I. Preprocessing Pipeline¶
- Load Historical Market Data
import yfinance as yf
import pandas as pd
import numpy as np
ticker = "^GSPC"
df = yf.download(ticker, start="2015-01-01", end="2025-01-01")
df.head()
/tmp/ipython-input-4-3101164333.py:6: FutureWarning: YF.download() has changed argument auto_adjust default to True df = yf.download(ticker, start="2015-01-01", end="2025-01-01") [*********************100%***********************] 1 of 1 completed
| Price | Close | High | Low | Open | Volume |
|---|---|---|---|---|---|
| Ticker | ^GSPC | ^GSPC | ^GSPC | ^GSPC | ^GSPC |
| Date | |||||
| 2015-01-02 | 2058.199951 | 2072.360107 | 2046.040039 | 2058.899902 | 2708700000 |
| 2015-01-05 | 2020.579956 | 2054.439941 | 2017.339966 | 2054.439941 | 3799120000 |
| 2015-01-06 | 2002.609985 | 2030.250000 | 1992.439941 | 2022.150024 | 4460110000 |
| 2015-01-07 | 2025.900024 | 2029.609985 | 2005.550049 | 2005.550049 | 3805480000 |
| 2015-01-08 | 2062.139893 | 2064.080078 | 2030.609985 | 2030.609985 | 3934010000 |
Source: Yahoo Finance via yfinance API
Ticker Used: ^GSPC (S&P 500 Index), representing broad market trends
Date Range: January 1, 2015 – December 31, 2024
Columns: Open, High, Low, Close, Volume
Saving Original Dataset
raw_df = yf.download("^GSPC", start="2015-01-01", end="2025-01-01")
raw_df.to_csv("project_dataset_sp500_raw.csv")
/tmp/ipython-input-5-137008132.py:1: FutureWarning: YF.download() has changed argument auto_adjust default to True
raw_df = yf.download("^GSPC", start="2015-01-01", end="2025-01-01")
[*********************100%***********************] 1 of 1 completed
- Compute Log Returns
Log returns provide stationarity and are commonly used in financial modeling.
Formula: log(Close_t / Close_t-1)
df.columns = df.columns.get_level_values(0)
df["LogReturn"] = np.log(df["Close"] / df["Close"].shift(1))
df.head(8)
| Price | Close | High | Low | Open | Volume | LogReturn |
|---|---|---|---|---|---|---|
| Date | ||||||
| 2015-01-02 | 2058.199951 | 2072.360107 | 2046.040039 | 2058.899902 | 2708700000 | NaN |
| 2015-01-05 | 2020.579956 | 2054.439941 | 2017.339966 | 2054.439941 | 3799120000 | -0.018447 |
| 2015-01-06 | 2002.609985 | 2030.250000 | 1992.439941 | 2022.150024 | 4460110000 | -0.008933 |
| 2015-01-07 | 2025.900024 | 2029.609985 | 2005.550049 | 2005.550049 | 3805480000 | 0.011563 |
| 2015-01-08 | 2062.139893 | 2064.080078 | 2030.609985 | 2030.609985 | 3934010000 | 0.017730 |
| 2015-01-09 | 2044.810059 | 2064.429932 | 2038.329956 | 2063.449951 | 3364140000 | -0.008439 |
| 2015-01-12 | 2028.260010 | 2049.300049 | 2022.579956 | 2046.130005 | 3456460000 | -0.008127 |
| 2015-01-13 | 2023.030029 | 2056.929932 | 2008.250000 | 2031.579956 | 4107300000 | -0.002582 |
- Compute Technical Indicators
These features provide the model with signals on trend, momentum, and potential turning points.
a. Relative Strength Index (RSI) — Momentum Indicator
- Measures recent gain/loss to detect overbought/oversold conditions.
def compute_rsi(series, window=14):
delta = series.diff()
gain = delta.clip(lower=0).rolling(window).mean()
loss = -delta.clip(upper=0).rolling(window).mean()
rs = gain / loss
return 100 - (100 / (1 + rs))
df["RSI"] = compute_rsi(df["Close"])
df.head(20)
| Price | Close | High | Low | Open | Volume | LogReturn | RSI |
|---|---|---|---|---|---|---|---|
| Date | |||||||
| 2015-01-02 | 2058.199951 | 2072.360107 | 2046.040039 | 2058.899902 | 2708700000 | NaN | NaN |
| 2015-01-05 | 2020.579956 | 2054.439941 | 2017.339966 | 2054.439941 | 3799120000 | -0.018447 | NaN |
| 2015-01-06 | 2002.609985 | 2030.250000 | 1992.439941 | 2022.150024 | 4460110000 | -0.008933 | NaN |
| 2015-01-07 | 2025.900024 | 2029.609985 | 2005.550049 | 2005.550049 | 3805480000 | 0.011563 | NaN |
| 2015-01-08 | 2062.139893 | 2064.080078 | 2030.609985 | 2030.609985 | 3934010000 | 0.017730 | NaN |
| 2015-01-09 | 2044.810059 | 2064.429932 | 2038.329956 | 2063.449951 | 3364140000 | -0.008439 | NaN |
| 2015-01-12 | 2028.260010 | 2049.300049 | 2022.579956 | 2046.130005 | 3456460000 | -0.008127 | NaN |
| 2015-01-13 | 2023.030029 | 2056.929932 | 2008.250000 | 2031.579956 | 4107300000 | -0.002582 | NaN |
| 2015-01-14 | 2011.270020 | 2018.400024 | 1988.439941 | 2018.400024 | 4378680000 | -0.005830 | NaN |
| 2015-01-15 | 1992.670044 | 2021.349976 | 1991.469971 | 2013.750000 | 4276720000 | -0.009291 | NaN |
| 2015-01-16 | 2019.420044 | 2020.459961 | 1988.119995 | 1992.250000 | 4056410000 | 0.013335 | NaN |
| 2015-01-20 | 2022.550049 | 2028.939941 | 2004.489990 | 2020.760010 | 3944340000 | 0.001549 | NaN |
| 2015-01-21 | 2032.119995 | 2038.290039 | 2012.040039 | 2020.189941 | 3730070000 | 0.004720 | NaN |
| 2015-01-22 | 2063.149902 | 2064.620117 | 2026.380005 | 2034.300049 | 4176050000 | 0.015154 | NaN |
| 2015-01-23 | 2051.820068 | 2062.979980 | 2050.540039 | 2062.979980 | 3573560000 | -0.005507 | 48.802572 |
| 2015-01-26 | 2057.090088 | 2057.620117 | 2040.969971 | 2050.419922 | 3465760000 | 0.002565 | 57.799662 |
| 2015-01-27 | 2029.550049 | 2047.859985 | 2019.910034 | 2047.859985 | 3329810000 | -0.013478 | 55.529127 |
| 2015-01-28 | 2002.160034 | 2042.489990 | 2001.489990 | 2032.339966 | 4067530000 | -0.013588 | 45.208292 |
| 2015-01-29 | 2021.250000 | 2024.640015 | 1989.180054 | 2002.449951 | 4127140000 | 0.009490 | 41.132852 |
| 2015-01-30 | 1994.989990 | 2023.319946 | 1993.380005 | 2019.349976 | 4568650000 | -0.013077 | 39.599140 |
b. MACD (Moving Average Convergence Divergence) — Trend Change Indicator
- Difference between fast and slow EMAs of the closing price
def compute_macd(series, fast=12, slow=26, signal=9):
ema_fast = series.ewm(span=fast, adjust=False).mean()
ema_slow = series.ewm(span=slow, adjust=False).mean()
macd = ema_fast - ema_slow
signal_line = macd.ewm(span=signal, adjust=False).mean()
return macd, signal_line
df["MACD"], df["MACD_Signal"] = compute_macd(df["Close"])
df.head(20)
| Price | Close | High | Low | Open | Volume | LogReturn | RSI | MACD | MACD_Signal |
|---|---|---|---|---|---|---|---|---|---|
| Date | |||||||||
| 2015-01-02 | 2058.199951 | 2072.360107 | 2046.040039 | 2058.899902 | 2708700000 | NaN | NaN | 0.000000 | 0.000000 |
| 2015-01-05 | 2020.579956 | 2054.439941 | 2017.339966 | 2054.439941 | 3799120000 | -0.018447 | NaN | -3.001025 | -0.600205 |
| 2015-01-06 | 2002.609985 | 2030.250000 | 1992.439941 | 2022.150024 | 4460110000 | -0.008933 | NaN | -6.751558 | -1.830476 |
| 2015-01-07 | 2025.900024 | 2029.609985 | 2005.550049 | 2005.550049 | 3805480000 | 0.011563 | NaN | -7.755174 | -3.015415 |
| 2015-01-08 | 2062.139893 | 2064.080078 | 2030.609985 | 2030.609985 | 3934010000 | 0.017730 | NaN | -5.562175 | -3.524767 |
| 2015-01-09 | 2044.810059 | 2064.429932 | 2038.329956 | 2063.449951 | 3364140000 | -0.008439 | NaN | -5.163064 | -3.852427 |
| 2015-01-12 | 2028.260010 | 2049.300049 | 2022.579956 | 2046.130005 | 3456460000 | -0.008127 | NaN | -6.111763 | -4.304294 |
| 2015-01-13 | 2023.030029 | 2056.929932 | 2008.250000 | 2031.579956 | 4107300000 | -0.002582 | NaN | -7.202603 | -4.883956 |
| 2015-01-14 | 2011.270020 | 2018.400024 | 1988.439941 | 2018.400024 | 4378680000 | -0.005830 | NaN | -8.913289 | -5.689822 |
| 2015-01-15 | 1992.670044 | 2021.349976 | 1991.469971 | 2013.750000 | 4276720000 | -0.009291 | NaN | -11.635753 | -6.879009 |
| 2015-01-16 | 2019.420044 | 2020.459961 | 1988.119995 | 1992.250000 | 4056410000 | 0.013335 | NaN | -11.502233 | -7.803654 |
| 2015-01-20 | 2022.550049 | 2028.939941 | 2004.489990 | 2020.760010 | 3944340000 | 0.001549 | NaN | -11.016857 | -8.446294 |
| 2015-01-21 | 2032.119995 | 2038.290039 | 2012.040039 | 2020.189941 | 3730070000 | 0.004720 | NaN | -9.747614 | -8.706558 |
| 2015-01-22 | 2063.149902 | 2064.620117 | 2026.380005 | 2034.300049 | 4176050000 | 0.015154 | NaN | -6.166789 | -8.198604 |
| 2015-01-23 | 2051.820068 | 2062.979980 | 2050.540039 | 2062.979980 | 3573560000 | -0.005507 | 48.802572 | -4.194826 | -7.397849 |
| 2015-01-26 | 2057.090088 | 2057.620117 | 2040.969971 | 2050.419922 | 3465760000 | 0.002565 | 57.799662 | -2.181637 | -6.354606 |
| 2015-01-27 | 2029.550049 | 2047.859985 | 2019.910034 | 2047.859985 | 3329810000 | -0.013478 | 55.529127 | -2.776416 | -5.638968 |
| 2015-01-28 | 2002.160034 | 2042.489990 | 2001.489990 | 2032.339966 | 4067530000 | -0.013588 | 45.208292 | -5.395729 | -5.590320 |
| 2015-01-29 | 2021.250000 | 2024.640015 | 1989.180054 | 2002.449951 | 4127140000 | 0.009490 | 41.132852 | -5.863561 | -5.644969 |
| 2015-01-30 | 1994.989990 | 2023.319946 | 1993.380005 | 2019.349976 | 4568650000 | -0.013077 | 39.599140 | -8.258091 | -6.167593 |
- Handle Missing Values
- Drop rows with NaN values due to rolling indicators or first-day return computation.
df.dropna(inplace=True)
df.head()
| Price | Close | High | Low | Open | Volume | LogReturn | RSI | MACD | MACD_Signal |
|---|---|---|---|---|---|---|---|---|---|
| Date | |||||||||
| 2015-01-23 | 2051.820068 | 2062.979980 | 2050.540039 | 2062.979980 | 3573560000 | -0.005507 | 48.802572 | -4.194826 | -7.397849 |
| 2015-01-26 | 2057.090088 | 2057.620117 | 2040.969971 | 2050.419922 | 3465760000 | 0.002565 | 57.799662 | -2.181637 | -6.354606 |
| 2015-01-27 | 2029.550049 | 2047.859985 | 2019.910034 | 2047.859985 | 3329810000 | -0.013478 | 55.529127 | -2.776416 | -5.638968 |
| 2015-01-28 | 2002.160034 | 2042.489990 | 2001.489990 | 2032.339966 | 4067530000 | -0.013588 | 45.208292 | -5.395729 | -5.590320 |
| 2015-01-29 | 2021.250000 | 2024.640015 | 1989.180054 | 2002.449951 | 4127140000 | 0.009490 | 41.132852 | -5.863561 | -5.644969 |
- Save the Processed Dataset
df.to_csv("project_dataset_sp500_processed.csv", index=True)
II. Feature Normalization¶
Apply standardization to features like RSI, MACD, Volume, etc.
This step improves neural network convergence.
from sklearn.preprocessing import StandardScaler
# Copy the original DataFrame
df_scaled = df.copy()
# Select the columns to normalize
feature_cols = ["LogReturn", "RSI", "MACD", "MACD_Signal", "Volume"]
# Fit and transform
scaler = StandardScaler()
df_scaled[feature_cols] = scaler.fit_transform(df_scaled[feature_cols])
df_scaled[["LogReturn", "RSI", "MACD", "MACD_Signal", "Volume"]].head()
| Price | LogReturn | RSI | MACD | MACD_Signal | Volume |
|---|---|---|---|---|---|
| Date | |||||
| 2015-01-23 | -0.525811 | -0.487923 | -0.414165 | -0.537220 | -0.451163 |
| 2015-01-26 | 0.190475 | 0.072745 | -0.359326 | -0.506722 | -0.563553 |
| 2015-01-27 | -1.233205 | -0.068747 | -0.375528 | -0.485801 | -0.705291 |
| 2015-01-28 | -1.242897 | -0.711906 | -0.446877 | -0.484379 | 0.063839 |
| 2015-01-29 | 0.804935 | -0.965874 | -0.459621 | -0.485977 | 0.125987 |
Save the Cleaned & Scaled Dataset
df_scaled.to_csv("project_dataset_sp500_processed_cleaned_scaled.csv", index=True)
III. Exploratory Data Analysis & Visualization¶
- Distribution of Target Variable
import matplotlib.pyplot as plt
import seaborn as sns
plt.figure(figsize=(8, 4))
sns.histplot(df_scaled["LogReturn"], bins=50, kde=True, color='skyblue')
plt.title("Distribution of Log Returns")
plt.xlabel("Log Return")
plt.ylabel("Frequency")
plt.grid(True)
plt.show()
This histogram shows the distribution of daily log returns. The distribution is sharply peaked around 0 and exhibits fat tails, reflecting the presence of extreme market movements (e.g., crashes or rallies). Such behavior aligns with known stylized facts in financial econometrics and motivates the need for uncertainty-aware forecasting.
- Correlation Between Features
plt.figure(figsize=(8, 6))
sns.heatmap(df_scaled[feature_cols].corr(), annot=True, cmap='coolwarm')
plt.title("Correlation Heatmap of Input Features")
plt.show()
This heatmap shows Pearson correlations between engineered features. MACD and its signal are highly correlated as expected. RSI shows moderate correlation with trend indicators. Volume is weakly inversely related to other features and may contribute to detecting volatility regimes.
- Effect of Feature Scaling
plt.figure(figsize=(10, 4))
plt.plot(df["RSI"].values, label="Raw RSI", alpha=0.5)
plt.plot(df_scaled["RSI"].values, label="Scaled RSI", alpha=0.8)
plt.legend()
plt.title("Raw vs Scaled RSI")
plt.xlabel("Time Steps")
plt.ylabel("RSI Value")
plt.show()
The plot compares the original RSI values with their normalized counterparts using standardization. Although the scale is compressed, the overall trend and pattern are preserved. This confirms that scaling does not distort signal information but helps stabilize neural network training.
- Log Returns Over Time
plt.figure(figsize=(12, 4))
plt.plot(df.index, df["LogReturn"], label="Log Return")
plt.axhline(0, linestyle='--', color='gray', alpha=0.6)
plt.title("Daily Log Returns Over Time")
plt.xlabel("Date")
plt.ylabel("Log Return")
plt.grid(True)
plt.legend()
plt.show()
This plot shows the daily log returns over the entire dataset period. A large spike in early 2020 corresponds to the COVID-19 market crash and rebound. This confirms the presence of high-volatility regimes and validates the importance of including such periods for robustness testing of forecasting models.
IV. Sliding Window Construction¶
- Target: Next-day log return or close price
- Input: N-day sequences of features
Create Sliding Window Dataset
import numpy as np
def create_sliding_window(X, y, window_size=30):
Xs, ys = [], []
for i in range(len(X) - window_size):
Xs.append(X[i:i + window_size])
ys.append(y[i + window_size])
return np.array(Xs), np.array(ys)
# Define input and output
features = df_scaled[feature_cols].values
target = df_scaled["LogReturn"].values
# Create sliding window dataset
X, y = create_sliding_window(features, target, window_size=30)
# Check shapes
print("X shape:", X.shape) # (samples, window_size, features)
print("y shape:", y.shape) # (samples,)
X shape: (2472, 30, 5) y shape: (2472,)
IV. Data Splitting (Walk-Forward Split)¶
# 70% train, 15% val, 15% test
n = len(X)
train_end = int(n * 0.7)
val_end = int(n * 0.85)
X_train, y_train = X[:train_end], y[:train_end]
X_val, y_val = X[train_end:val_end], y[train_end:val_end]
X_test, y_test = X[val_end:], y[val_end:]
print("Train:", X_train.shape, y_train.shape)
print("Val: ", X_val.shape, y_val.shape)
print("Test: ", X_test.shape, y_test.shape)
Train: (1730, 30, 5) (1730,) Val: (371, 30, 5) (371,) Test: (371, 30, 5) (371,)
Walk-Forward Split Strategy¶
For time series forecasting, I use a walk-forward split instead of random shuffling. This preserves the temporal order of the data and mimics real-world forecasting, where the model only has access to past observations.
Why Use Walk-Forward Splitting?¶
- Prevents lookahead bias
- Ensures model evaluation simulates true deployment
- Maintains temporal causality (training on past, testing on future)
- Allows assessment of robustness across market regimes (e.g., COVID-2020 → post-COVID recovery → inflation era)
Our Split:¶
- Training Set (70%): 2015–2021 (approx.)
- Validation Set (15%): 2022
- Test Set (15%): 2023–2025
V. Baseline Model: LSTM for Log Return Forecasting¶
- Define the LSTM Architecture
This model implements a single-layer LSTM network with 64 hidden units, followed by a ReLU activation and two fully connected layers. The model outputs a single value representing the next day's log return. This forms the baseline for comparison with GRU and Transformer models in later sections.
import torch
import torch.nn as nn
class LSTMRegressor(nn.Module):
def __init__(self, input_size, hidden_size=64, num_layers=1, dropout=0.5):
super(LSTMRegressor, self).__init__()
# LSTM layer with specified input and hidden dimensions
self.lstm = nn.LSTM(input_size, hidden_size, num_layers,
batch_first=True, dropout=dropout)
# Fully connected layers for regression
self.fc1 = nn.Linear(hidden_size, 32)
self.relu = nn.ReLU()
self.fc2 = nn.Linear(32, 1) # Predict a single log return
def forward(self, x):
out, _ = self.lstm(x) # out shape: [batch_size, seq_len, hidden_size]
out = out[:, -1, :] # Select the output of the last time step
out = self.relu(self.fc1(out))
return self.fc2(out)
Instantiate
input_size = X_train.shape[2] # 5 features
model = LSTMRegressor(input_size)
- Loss Function and Optimizer
The model uses Mean Squared Error (MSE) as the loss function for log return regression. The Adam optimizer is used for training.
import torch.optim as optim
# Move model to device
model = model.to(device)
# Loss function: MSE for regression
criterion = nn.MSELoss()
# Optimizer: Adam
optimizer = optim.Adam(model.parameters(), lr=1e-3)
- Train and Validate
Convert to PyTorch Dataset and DataLoader
from torch.utils.data import TensorDataset, DataLoader
# Convert to tensors
X_train_tensor = torch.tensor(X_train, dtype=torch.float32)
y_train_tensor = torch.tensor(y_train, dtype=torch.float32)
X_val_tensor = torch.tensor(X_val, dtype=torch.float32)
y_val_tensor = torch.tensor(y_val, dtype=torch.float32)
# Create datasets and loaders
train_ds = TensorDataset(X_train_tensor, y_train_tensor)
val_ds = TensorDataset(X_val_tensor, y_val_tensor)
train_loader = DataLoader(train_ds, batch_size=32, shuffle=False)
val_loader = DataLoader(val_ds, batch_size=32, shuffle=False)
The training and validation sets are converted into PyTorch TensorDataset and loaded using DataLoader. Which avoid shuffling to maintain temporal order during training. Each sample is a 30-day sequence of features, with the target being the next day's log return.
Define Training Loop
def train_model(model, train_loader, val_loader, criterion, optimizer, epochs=50):
model.to(device)
train_losses, val_losses = [], []
for epoch in range(epochs):
model.train()
epoch_train_loss = 0.0
for xb, yb in train_loader:
xb, yb = xb.to(device), yb.to(device)
optimizer.zero_grad()
preds = model(xb).squeeze()
loss = criterion(preds, yb)
loss.backward()
optimizer.step()
epoch_train_loss += loss.item()
train_loss = epoch_train_loss / len(train_loader)
train_losses.append(train_loss)
# Evaluate on validation set
model.eval()
with torch.no_grad():
val_loss = sum(
criterion(model(xb.to(device)).squeeze(), yb.to(device)).item()
for xb, yb in val_loader
) / len(val_loader)
val_losses.append(val_loss)
print(f"Epoch {epoch+1}/{epochs} | Train Loss: {train_loss:.4f} | Val Loss: {val_loss:.4f}")
return model, train_losses, val_losses
The model is trained using a custom loop over mini-batches. Mean Squared Error (MSE) is computed for both training and validation sets at each epoch. Temporal order is preserved by not shuffling data during loading.
Train the Model
# Train the model using the function
trained_model, train_losses, val_losses = train_model(
model=model,
train_loader=train_loader,
val_loader=val_loader,
criterion=criterion,
optimizer=optimizer,
epochs=50
)
Epoch 1/50 | Train Loss: 1.0350 | Val Loss: 1.4027 Epoch 2/50 | Train Loss: 1.0276 | Val Loss: 1.4016 Epoch 3/50 | Train Loss: 1.0203 | Val Loss: 1.4030 Epoch 4/50 | Train Loss: 1.0081 | Val Loss: 1.4492 Epoch 5/50 | Train Loss: 0.9924 | Val Loss: 1.4233 Epoch 6/50 | Train Loss: 0.9669 | Val Loss: 1.7221 Epoch 7/50 | Train Loss: 0.9521 | Val Loss: 1.9126 Epoch 8/50 | Train Loss: 0.9561 | Val Loss: 1.4753 Epoch 9/50 | Train Loss: 0.9230 | Val Loss: 1.8217 Epoch 10/50 | Train Loss: 0.9473 | Val Loss: 1.4530 Epoch 11/50 | Train Loss: 0.9300 | Val Loss: 1.5386 Epoch 12/50 | Train Loss: 0.8864 | Val Loss: 1.4918 Epoch 13/50 | Train Loss: 0.8589 | Val Loss: 1.6523 Epoch 14/50 | Train Loss: 0.8362 | Val Loss: 1.5123 Epoch 15/50 | Train Loss: 0.8177 | Val Loss: 1.6076 Epoch 16/50 | Train Loss: 0.7897 | Val Loss: 1.5215 Epoch 17/50 | Train Loss: 0.8083 | Val Loss: 1.6851 Epoch 18/50 | Train Loss: 0.7934 | Val Loss: 1.5412 Epoch 19/50 | Train Loss: 0.8214 | Val Loss: 1.4825 Epoch 20/50 | Train Loss: 0.7626 | Val Loss: 1.6418 Epoch 21/50 | Train Loss: 0.7114 | Val Loss: 1.6559 Epoch 22/50 | Train Loss: 0.6930 | Val Loss: 1.6667 Epoch 23/50 | Train Loss: 0.6866 | Val Loss: 1.7436 Epoch 24/50 | Train Loss: 0.7091 | Val Loss: 1.6224 Epoch 25/50 | Train Loss: 0.7244 | Val Loss: 1.7596 Epoch 26/50 | Train Loss: 0.7122 | Val Loss: 1.7439 Epoch 27/50 | Train Loss: 0.6772 | Val Loss: 1.7833 Epoch 28/50 | Train Loss: 0.6334 | Val Loss: 1.8134 Epoch 29/50 | Train Loss: 0.6106 | Val Loss: 1.8462 Epoch 30/50 | Train Loss: 0.6007 | Val Loss: 1.8630 Epoch 31/50 | Train Loss: 0.5858 | Val Loss: 1.9117 Epoch 32/50 | Train Loss: 0.5890 | Val Loss: 1.9265 Epoch 33/50 | Train Loss: 0.5757 | Val Loss: 1.9046 Epoch 34/50 | Train Loss: 0.5908 | Val Loss: 1.8915 Epoch 35/50 | Train Loss: 0.5654 | Val Loss: 1.8402 Epoch 36/50 | Train Loss: 0.5509 | Val Loss: 2.1095 Epoch 37/50 | Train Loss: 0.5517 | Val Loss: 1.9208 Epoch 38/50 | Train Loss: 0.5725 | Val Loss: 2.0798 Epoch 39/50 | Train Loss: 0.5726 | Val Loss: 2.1146 Epoch 40/50 | Train Loss: 0.5821 | Val Loss: 2.0669 Epoch 41/50 | Train Loss: 0.5534 | Val Loss: 1.9426 Epoch 42/50 | Train Loss: 0.5267 | Val Loss: 1.9810 Epoch 43/50 | Train Loss: 0.5070 | Val Loss: 2.0834 Epoch 44/50 | Train Loss: 0.4996 | Val Loss: 2.1690 Epoch 45/50 | Train Loss: 0.4972 | Val Loss: 2.0050 Epoch 46/50 | Train Loss: 0.5094 | Val Loss: 1.9477 Epoch 47/50 | Train Loss: 0.5071 | Val Loss: 2.1182 Epoch 48/50 | Train Loss: 0.4767 | Val Loss: 2.1653 Epoch 49/50 | Train Loss: 0.4636 | Val Loss: 2.2164 Epoch 50/50 | Train Loss: 0.4780 | Val Loss: 2.1549
Save training details
import pickle
# Save training history
with open("lstm_training_history.pkl", "wb") as f:
pickle.dump({"train_losses": train_losses, "val_losses": val_losses}, f)
Plot Loss Curves
import matplotlib.pyplot as plt
plt.plot(train_losses, label="Train Loss")
plt.plot(val_losses, label="Val Loss")
plt.xlabel("Epoch")
plt.ylabel("MSE Loss")
plt.title("Training and Validation Loss Over Epochs")
plt.legend()
plt.grid(True)
plt.show()
This plot shows the mean squared error (MSE) loss over 50 epochs for the baseline LSTM model. While the training loss steadily decreases, the validation loss begins to rise after approximately 20 epochs, indicating early signs of overfitting. This suggests that the model is learning the training data well but may be losing generalization capability over time.
In future phases, techniques such as early stopping, dropout tuning, or learning rate scheduling may help mitigate overfitting and improve performance on unseen data.
- Evaluate on Test Set
Prepare Test Data
# Convert test data to tensors
X_test_tensor = torch.tensor(X_test, dtype=torch.float32).to(device)
y_test_tensor = torch.tensor(y_test, dtype=torch.float32).to(device)
Predict and Compare
# Put model in eval mode
model.eval()
# Disable gradient tracking
with torch.no_grad():
y_pred_tensor = model(X_test_tensor).squeeze()
# Convert to cpu numpy arrays
y_true = y_test_tensor.detach().cpu().numpy()
y_pred_lstm = y_pred_tensor.detach().cpu().numpy()
Compute Metrics
from sklearn.metrics import mean_squared_error, mean_absolute_error
import numpy as np
mse = mean_squared_error(y_true, y_pred_lstm)
rmse = np.sqrt(mse)
mae = mean_absolute_error(y_true, y_pred_lstm)
print(f"LSTM Test MSE : {mse:.4f}")
print(f"LSTM Test RMSE: {rmse:.4f}")
print(f"LSTM Test MAE : {mae:.4f}")
LSTM Test MSE : 0.5669 LSTM Test RMSE: 0.7529 LSTM Test MAE : 0.5677
The baseline LSTM model was evaluated on the final 15% of the dataset using MSE, RMSE, and MAE metrics:
- Test MSE: 0.5669
- Test RMSE: 0.7529
- Test MAE: 0.5677
Plot Predictions vs Actual
After training, the model was evaluated on a held-out test set representing the most recent portion of the time series. Metrics such as Mean Squared Error (MSE), Root Mean Squared Error (RMSE), and Mean Absolute Error (MAE) were computed. The plot below shows predicted vs actual log returns to visualize model performance.
import matplotlib.pyplot as plt
plt.figure(figsize=(10, 4))
plt.plot(y_true, label="Actual", alpha=0.7)
plt.plot(y_pred_lstm, label="Predicted", alpha=0.7)
plt.title("Predicted vs Actual Log Returns on Test Set")
plt.xlabel("Time Step")
plt.ylabel("Log Return")
plt.legend()
plt.grid(True)
plt.show()
The plot shows predicted vs actual daily log returns. The model captures the general trend but underestimates extreme values, a common limitation of non-probabilistic baselines. These results will serve as a benchmark for more advanced models and uncertainty-aware extensions in future phases.
- Save the weights
# Save model weights to a .pt file
torch.save(model.state_dict(), "project_weights_lstm_baseline.pt")
VI. Baseline Model: GRU for Log Return Forecasting¶
To complement the LSTM baseline, implemented a GRU model using the same architecture and hyperparameters. GRUs are computationally more efficient while retaining the ability to model short-to-medium-term dependencies. Evaluation metrics are computed on the same test set to provide a fair comparison with the LSTM model.
- GRU Model Definition
import torch.nn as nn
class GRURegressor(nn.Module):
def __init__(self, input_size, hidden_size=64, num_layers=1, dropout=0.2):
super(GRURegressor, self).__init__()
# GRU layer with specified input and hidden dimensions
self.gru = nn.GRU(input_size, hidden_size, num_layers,
batch_first=True, dropout=dropout)
# Fully connected layers for regression
self.fc1 = nn.Linear(hidden_size, 32)
self.relu = nn.ReLU()
self.fc2 = nn.Linear(32, 1)
def forward(self, x):
out, _ = self.gru(x) # out shape: [batch_size, seq_len, hidden_size]
out = out[:, -1, :] # Select the output of the last time step
out = self.relu(self.fc1(out))
return self.fc2(out)
- Instantiate and Compile GRU Model
# Instantiate GRU model
gru_model = GRURegressor(input_size=X_train.shape[2]).to(device)
# Define loss and optimizer
criterion = nn.MSELoss()
optimizer_gru = torch.optim.Adam(gru_model.parameters(), lr=1e-3)
- Train the GRU Model
gru_model, gru_train_losses, gru_val_losses = train_model(
model=gru_model,
train_loader=train_loader,
val_loader=val_loader,
criterion=criterion,
optimizer=optimizer_gru,
epochs=50
)
Epoch 1/50 | Train Loss: 1.0347 | Val Loss: 1.4047 Epoch 2/50 | Train Loss: 1.0236 | Val Loss: 1.4122 Epoch 3/50 | Train Loss: 1.0134 | Val Loss: 1.4159 Epoch 4/50 | Train Loss: 1.0006 | Val Loss: 1.4433 Epoch 5/50 | Train Loss: 0.9824 | Val Loss: 1.5100 Epoch 6/50 | Train Loss: 0.9623 | Val Loss: 1.5791 Epoch 7/50 | Train Loss: 0.9435 | Val Loss: 1.6951 Epoch 8/50 | Train Loss: 0.9243 | Val Loss: 1.5223 Epoch 9/50 | Train Loss: 0.9030 | Val Loss: 1.5249 Epoch 10/50 | Train Loss: 0.8895 | Val Loss: 1.9763 Epoch 11/50 | Train Loss: 0.8954 | Val Loss: 1.5582 Epoch 12/50 | Train Loss: 0.8797 | Val Loss: 1.5629 Epoch 13/50 | Train Loss: 0.8479 | Val Loss: 1.5804 Epoch 14/50 | Train Loss: 0.8101 | Val Loss: 1.6208 Epoch 15/50 | Train Loss: 0.7693 | Val Loss: 1.7468 Epoch 16/50 | Train Loss: 0.7546 | Val Loss: 1.5480 Epoch 17/50 | Train Loss: 0.7884 | Val Loss: 1.8249 Epoch 18/50 | Train Loss: 0.7483 | Val Loss: 1.6710 Epoch 19/50 | Train Loss: 0.7880 | Val Loss: 1.5944 Epoch 20/50 | Train Loss: 0.7130 | Val Loss: 1.7566 Epoch 21/50 | Train Loss: 0.6660 | Val Loss: 1.6715 Epoch 22/50 | Train Loss: 0.6542 | Val Loss: 1.7973 Epoch 23/50 | Train Loss: 0.6446 | Val Loss: 1.6629 Epoch 24/50 | Train Loss: 0.6484 | Val Loss: 1.8345 Epoch 25/50 | Train Loss: 0.6391 | Val Loss: 1.6601 Epoch 26/50 | Train Loss: 0.6400 | Val Loss: 1.8181 Epoch 27/50 | Train Loss: 0.6175 | Val Loss: 1.7298 Epoch 28/50 | Train Loss: 0.6035 | Val Loss: 1.7988 Epoch 29/50 | Train Loss: 0.5898 | Val Loss: 1.7125 Epoch 30/50 | Train Loss: 0.5833 | Val Loss: 1.8202 Epoch 31/50 | Train Loss: 0.5756 | Val Loss: 1.6678 Epoch 32/50 | Train Loss: 0.5718 | Val Loss: 1.9482 Epoch 33/50 | Train Loss: 0.5709 | Val Loss: 1.5979 Epoch 34/50 | Train Loss: 0.5901 | Val Loss: 2.1256 Epoch 35/50 | Train Loss: 0.5756 | Val Loss: 1.6334 Epoch 36/50 | Train Loss: 0.5657 | Val Loss: 1.9053 Epoch 37/50 | Train Loss: 0.5437 | Val Loss: 1.7698 Epoch 38/50 | Train Loss: 0.5303 | Val Loss: 1.7931 Epoch 39/50 | Train Loss: 0.5379 | Val Loss: 1.7863 Epoch 40/50 | Train Loss: 0.5214 | Val Loss: 1.8299 Epoch 41/50 | Train Loss: 0.5198 | Val Loss: 1.9728 Epoch 42/50 | Train Loss: 0.5124 | Val Loss: 1.9133 Epoch 43/50 | Train Loss: 0.5156 | Val Loss: 2.0756 Epoch 44/50 | Train Loss: 0.5169 | Val Loss: 1.8781 Epoch 45/50 | Train Loss: 0.5161 | Val Loss: 2.0113 Epoch 46/50 | Train Loss: 0.4915 | Val Loss: 1.9908 Epoch 47/50 | Train Loss: 0.4860 | Val Loss: 2.1369 Epoch 48/50 | Train Loss: 0.4755 | Val Loss: 2.0962 Epoch 49/50 | Train Loss: 0.4742 | Val Loss: 2.2750 Epoch 50/50 | Train Loss: 0.4851 | Val Loss: 2.1362
Save training details
import pickle
# Save training history
with open("gru_training_history.pkl", "wb") as f:
pickle.dump({"train_losses": gru_train_losses, "val_losses": gru_val_losses}, f)
Plot Loss Curves
import matplotlib.pyplot as plt
plt.plot(gru_train_losses, label="Train Loss")
plt.plot(gru_val_losses, label="Val Loss")
plt.xlabel("Epoch")
plt.ylabel("MSE Loss")
plt.title("Training and Validation Loss Over Epochs")
plt.legend()
plt.grid(True)
plt.show()
The GRU model also exhibits a clear decline in training loss over time, demonstrating its ability to learn patterns from the training data. However, the validation loss remains consistently higher and more volatile, with noticeable spikes starting around epoch 10. This suggests the model struggles more with generalization compared to LSTM.
The increased variance in the validation curve may indicate sensitivity to specific patterns or a lack of capacity to model more complex dependencies. Further improvements could be explored through tuning dropout rates, adjusting hidden units, or applying regularization techniques.
- Evaluate GRU on Test Set
Predict and Compare
# Put model in eval mode
gru_model.eval()
# Disable gradient tracking
with torch.no_grad():
y_pred_tensor = gru_model(X_test_tensor).squeeze()
# Convert to cpu numpy arrays
y_true = y_test_tensor.cpu().numpy()
y_pred_gru = y_pred_tensor.cpu().numpy()
Compute Metrics
from sklearn.metrics import mean_squared_error, mean_absolute_error
import numpy as np
mse = mean_squared_error(y_true, y_pred_gru)
rmse = np.sqrt(mse)
mae = mean_absolute_error(y_true, y_pred_gru)
print(f"GRU Test MSE : {mse:.4f}")
print(f"GRU Test RMSE: {rmse:.4f}")
print(f"GRU Test MAE : {mae:.4f}")
GRU Test MSE : 0.6792 GRU Test RMSE: 0.8241 GRU Test MAE : 0.6231
The GRU model was trained using the same architecture, window size, and hyperparameters as the LSTM baseline to ensure a fair comparison. Below are the evaluation results on the same test set:
- GRU Test MSE : 0.6792
- GRU Test RMSE: 0.8241
- GRU Test MAE : 0.6231
For reference, the LSTM test metrics were:
- LSTM Test MSE: 0.5669
- LSTM Test RMSE: 0.7529
- LSTM Test MAE: 0.5677
The LSTM outperformed GRU slightly across all evaluation metrics, indicating it may be better suited to capturing the short-term dependencies in the dataset. However, the GRU still performed reasonably well and serves as a valid baseline model.
Plot Predictions vs Actual
import matplotlib.pyplot as plt
plt.figure(figsize=(10, 4))
plt.plot(y_true, label="Actual", alpha=0.7)
plt.plot(y_pred_gru, label="Predicted", alpha=0.7)
plt.title("Predicted vs Actual Log Returns on Test Set")
plt.xlabel("Time Step")
plt.ylabel("Log Return")
plt.legend()
plt.grid(True)
plt.show()
The plot above compares the GRU model’s predicted log returns against actual values on the test set. While the GRU captures overall trends and stays relatively stable, it tends to underfit high-volatility spikes and sharp directional movements. This mirrors the behavior seen with the LSTM, though GRU exhibits slightly larger deviation in volatile regions.
Overall, the GRU provides a reasonable forecast baseline, but the LSTM shows slightly stronger alignment with real movements. This visual comparison, alongside quantitative metrics, reinforces the decision to use LSTM as the primary architecture for further tuning and uncertainty modeling.
Save the weights
# Save model weights to a .pt file
torch.save(gru_model.state_dict(), "project_weights_gru_baseline.pt")
Side-by-Side Plot: LSTM vs GRU vs Actual
import matplotlib.pyplot as plt
plt.figure(figsize=(12, 5))
plt.plot(y_true, label="Actual", linewidth=1)
plt.plot(y_pred_lstm, label="LSTM Predicted", alpha=0.8)
plt.plot(y_pred_gru, label="GRU Predicted", alpha=0.8)
plt.title("Actual vs LSTM vs GRU Log Returns on Test Set")
plt.xlabel("Time Step")
plt.ylabel("Log Return")
plt.legend()
plt.grid(True)
plt.tight_layout()
plt.show()
This plot overlays the predictions of both the LSTM and GRU models against the actual log returns on the test set. Visually, we observe that:
- LSTM tends to follow the direction and magnitude of large movements more closely
- GRU captures trend direction but exhibits slightly more smoothing
- Both models struggle with extreme spikes, which are inherently difficult to forecast due to noise and market shocks
This aligns well with our earlier quantitative results, where LSTM had lower RMSE and MAE. These visual and numeric insights guide us to select LSTM as the stronger baseline for further uncertainty modeling and risk estimation.
I. Modularize Data Preparation¶
def prepare_data(df_scaled, feature_cols, target_col="LogReturn", seq_len=30, split_ratio=(0.7, 0.15, 0.15)):
# Extract input features and target
X_raw = df_scaled[feature_cols].values
y_raw = df_scaled[target_col].values
# Create sliding windows
X_seq, y_seq = create_sliding_window(X_raw, y_raw, window_size=seq_len)
# Split
n = len(X_seq)
train_end = int(n * split_ratio[0])
val_end = int(n * (split_ratio[0] + split_ratio[1]))
X_train, y_train = X_seq[:train_end], y_seq[:train_end]
X_val, y_val = X_seq[train_end:val_end], y_seq[train_end:val_end]
X_test, y_test = X_seq[val_end:], y_seq[val_end:]
return X_train, y_train, X_val, y_val, X_test, y_test
II. Modularize Data Loaders¶
from torch.utils.data import TensorDataset, DataLoader
def get_data_loaders(X_train, y_train, X_val, y_val, batch_size=32):
# Wrap training and validation data into DataLoaders.
X_train_tensor = torch.tensor(X_train, dtype=torch.float32)
y_train_tensor = torch.tensor(y_train, dtype=torch.float32).unsqueeze(1)
X_val_tensor = torch.tensor(X_val, dtype=torch.float32)
y_val_tensor = torch.tensor(y_val, dtype=torch.float32).unsqueeze(1)
train_ds = TensorDataset(X_train_tensor, y_train_tensor)
val_ds = TensorDataset(X_val_tensor, y_val_tensor)
train_loader = DataLoader(train_ds, batch_size=batch_size, shuffle=False)
val_loader = DataLoader(val_ds, batch_size=batch_size, shuffle=False)
return train_loader, val_loader
I. Training Loop for LSTM & GRU (Modularized)¶
import time
from torch.utils.tensorboard import SummaryWriter
def train_model(
model,
train_loader,
val_loader,
criterion,
optimizer,
epochs=50,
device=device,
verbose=True,
log_to_tensorboard=True,
config_name=None
):
model.to(device)
train_losses, val_losses = [], []
# TensorBoard writer
writer = None
if log_to_tensorboard:
tag = config_name or f"{model.__class__.__name__}_{int(time.time())}"
writer = SummaryWriter(log_dir=f"runs/{tag}")
for epoch in range(epochs):
model.train()
epoch_train_loss = 0.0
for xb, yb in train_loader:
xb, yb = xb.to(device), yb.to(device)
optimizer.zero_grad()
preds = model(xb)
loss = criterion(preds, yb)
loss.backward()
optimizer.step()
epoch_train_loss += loss.item()
train_loss = epoch_train_loss / len(train_loader)
train_losses.append(train_loss)
# Validation
model.eval()
with torch.no_grad():
val_loss = sum(
criterion(model(xb.to(device)), yb.to(device)).item()
for xb, yb in val_loader
) / len(val_loader)
val_losses.append(val_loss)
if verbose:
print(f"Epoch {epoch+1}/{epochs} | Train Loss: {train_loss:.4f} | Val Loss: {val_loss:.4f}")
if writer:
writer.add_scalar("Loss/Train", train_loss, epoch)
writer.add_scalar("Loss/Val", val_loss, epoch)
if writer:
writer.close()
return model, train_losses, val_losses
II . Find the Best Model by Grid Search¶
1. Define a Hyperparameter Grid
import itertools
param_grid = {
'hidden_size': [32, 64],
'dropout': [0.2, 0.3],
'lr': [1e-3, 5e-4],
'seq_len': [30, 60],
'num_layers': [2, 3]
}
def generate_configs(grid):
keys = grid.keys()
for values in itertools.product(*grid.values()):
yield dict(zip(keys, values))
2. LSTM: Hyperparameter Search Loop
2.1. Evaluation function
def evaluate_model(model, data_loader, criterion=nn.MSELoss(), device=device, return_predictions=False):
model.eval()
model.to(device)
preds, targets = [], []
with torch.no_grad():
for xb, yb in data_loader:
xb, yb = xb.to(device), yb.to(device)
pred = model(xb)
preds.append(pred.cpu())
targets.append(yb.cpu())
preds = torch.cat(preds).squeeze()
targets = torch.cat(targets).squeeze()
mse = torch.mean((preds - targets) ** 2).item()
rmse = np.sqrt(mse)
mae = torch.mean(torch.abs(preds - targets)).item()
if return_predictions:
return mse, rmse, mae, preds.numpy(), targets.numpy()
else:
return mse, rmse, mae
2.2. Training
results = []
for config in generate_configs(param_grid):
print(f"Running config: {config}")
# Prepare data based on seq_len
X_train, y_train, X_val, y_val, X_test, y_test = prepare_data(
df_scaled, feature_cols, seq_len=config['seq_len']
)
train_loader, val_loader = get_data_loaders(
X_train, y_train, X_val, y_val, batch_size=32
)
# Model
model = LSTMRegressor(
input_size=X_train.shape[2],
hidden_size=config['hidden_size'],
dropout=config['dropout'],
num_layers=config['num_layers'] # dynamic
)
# Optimizer, criterion
optimizer = torch.optim.Adam(model.parameters(), lr=config['lr'])
criterion = nn.MSELoss()
# Train
model, train_losses, val_losses = train_model(
model, train_loader, val_loader,
criterion=criterion,
optimizer=optimizer,
epochs=50,
verbose=False,
log_to_tensorboard=True,
config_name=f"LSTM_h{config['hidden_size']}_nl{config['num_layers']}_sl{config['seq_len']}_lr{config['lr']}"
)
# Evaluate on validation set
val_loader_only, _ = get_data_loaders(X_val, y_val, X_val, y_val)
_, rmse, mae = evaluate_model(model, val_loader_only)
results.append((config, rmse, mae))
print(f"RMSE: {rmse:.4f} | MAE: {mae:.4f}\n")
Running config: {'hidden_size': 32, 'dropout': 0.2, 'lr': 0.001, 'seq_len': 30, 'num_layers': 2}
RMSE: 1.2739 | MAE: 0.9764
Running config: {'hidden_size': 32, 'dropout': 0.2, 'lr': 0.001, 'seq_len': 30, 'num_layers': 3}
RMSE: 1.4027 | MAE: 1.0179
Running config: {'hidden_size': 32, 'dropout': 0.2, 'lr': 0.001, 'seq_len': 60, 'num_layers': 2}
RMSE: 1.3532 | MAE: 1.0162
Running config: {'hidden_size': 32, 'dropout': 0.2, 'lr': 0.001, 'seq_len': 60, 'num_layers': 3}
RMSE: 1.2236 | MAE: 0.9321
Running config: {'hidden_size': 32, 'dropout': 0.2, 'lr': 0.0005, 'seq_len': 30, 'num_layers': 2}
RMSE: 1.2990 | MAE: 1.0089
Running config: {'hidden_size': 32, 'dropout': 0.2, 'lr': 0.0005, 'seq_len': 30, 'num_layers': 3}
RMSE: 1.2495 | MAE: 0.9605
Running config: {'hidden_size': 32, 'dropout': 0.2, 'lr': 0.0005, 'seq_len': 60, 'num_layers': 2}
RMSE: 1.6032 | MAE: 1.1632
Running config: {'hidden_size': 32, 'dropout': 0.2, 'lr': 0.0005, 'seq_len': 60, 'num_layers': 3}
RMSE: 1.2177 | MAE: 0.9370
Running config: {'hidden_size': 32, 'dropout': 0.3, 'lr': 0.001, 'seq_len': 30, 'num_layers': 2}
RMSE: 1.4782 | MAE: 1.0958
Running config: {'hidden_size': 32, 'dropout': 0.3, 'lr': 0.001, 'seq_len': 30, 'num_layers': 3}
RMSE: 1.2438 | MAE: 0.9436
Running config: {'hidden_size': 32, 'dropout': 0.3, 'lr': 0.001, 'seq_len': 60, 'num_layers': 2}
RMSE: 1.3286 | MAE: 1.0067
Running config: {'hidden_size': 32, 'dropout': 0.3, 'lr': 0.001, 'seq_len': 60, 'num_layers': 3}
RMSE: 1.3198 | MAE: 0.9679
Running config: {'hidden_size': 32, 'dropout': 0.3, 'lr': 0.0005, 'seq_len': 30, 'num_layers': 2}
RMSE: 1.3366 | MAE: 1.0381
Running config: {'hidden_size': 32, 'dropout': 0.3, 'lr': 0.0005, 'seq_len': 30, 'num_layers': 3}
RMSE: 1.2445 | MAE: 0.9563
Running config: {'hidden_size': 32, 'dropout': 0.3, 'lr': 0.0005, 'seq_len': 60, 'num_layers': 2}
RMSE: 1.1990 | MAE: 0.9233
Running config: {'hidden_size': 32, 'dropout': 0.3, 'lr': 0.0005, 'seq_len': 60, 'num_layers': 3}
RMSE: 1.2087 | MAE: 0.9304
Running config: {'hidden_size': 64, 'dropout': 0.2, 'lr': 0.001, 'seq_len': 30, 'num_layers': 2}
RMSE: 1.3202 | MAE: 1.0232
Running config: {'hidden_size': 64, 'dropout': 0.2, 'lr': 0.001, 'seq_len': 30, 'num_layers': 3}
RMSE: 1.2567 | MAE: 0.9782
Running config: {'hidden_size': 64, 'dropout': 0.2, 'lr': 0.001, 'seq_len': 60, 'num_layers': 2}
RMSE: 1.2624 | MAE: 0.9647
Running config: {'hidden_size': 64, 'dropout': 0.2, 'lr': 0.001, 'seq_len': 60, 'num_layers': 3}
RMSE: 1.1957 | MAE: 0.9222
Running config: {'hidden_size': 64, 'dropout': 0.2, 'lr': 0.0005, 'seq_len': 30, 'num_layers': 2}
RMSE: 1.2610 | MAE: 0.9838
Running config: {'hidden_size': 64, 'dropout': 0.2, 'lr': 0.0005, 'seq_len': 30, 'num_layers': 3}
RMSE: 1.4576 | MAE: 1.0505
Running config: {'hidden_size': 64, 'dropout': 0.2, 'lr': 0.0005, 'seq_len': 60, 'num_layers': 2}
RMSE: 1.3276 | MAE: 0.9915
Running config: {'hidden_size': 64, 'dropout': 0.2, 'lr': 0.0005, 'seq_len': 60, 'num_layers': 3}
RMSE: 1.2283 | MAE: 0.9391
Running config: {'hidden_size': 64, 'dropout': 0.3, 'lr': 0.001, 'seq_len': 30, 'num_layers': 2}
RMSE: 1.2612 | MAE: 0.9836
Running config: {'hidden_size': 64, 'dropout': 0.3, 'lr': 0.001, 'seq_len': 30, 'num_layers': 3}
RMSE: 1.2422 | MAE: 0.9523
Running config: {'hidden_size': 64, 'dropout': 0.3, 'lr': 0.001, 'seq_len': 60, 'num_layers': 2}
RMSE: 1.4392 | MAE: 1.0741
Running config: {'hidden_size': 64, 'dropout': 0.3, 'lr': 0.001, 'seq_len': 60, 'num_layers': 3}
RMSE: 1.2831 | MAE: 0.9619
Running config: {'hidden_size': 64, 'dropout': 0.3, 'lr': 0.0005, 'seq_len': 30, 'num_layers': 2}
RMSE: 1.3363 | MAE: 1.0200
Running config: {'hidden_size': 64, 'dropout': 0.3, 'lr': 0.0005, 'seq_len': 30, 'num_layers': 3}
RMSE: 1.4134 | MAE: 1.0198
Running config: {'hidden_size': 64, 'dropout': 0.3, 'lr': 0.0005, 'seq_len': 60, 'num_layers': 2}
RMSE: 1.3152 | MAE: 0.9970
Running config: {'hidden_size': 64, 'dropout': 0.3, 'lr': 0.0005, 'seq_len': 60, 'num_layers': 3}
RMSE: 1.2753 | MAE: 0.9631
Performed a grid search over 32 different LSTM configurations by varying hidden size, dropout, learning rate, sequence length, and the number of layers. This helped us systematically evaluate performance across different setups and identify the best-performing model based on RMSE and MAE. This tuning step was crucial for ensuring the LSTM baseline was both competitive and well-calibrated before comparing it with GRU and Transformer architectures.
2.3. TensorBoard Vizualization
%load_ext tensorboard
%tensorboard --logdir=runs
Output hidden; open in https://colab.research.google.com to view.
2.4 Rank Top Configs (on Validation RMSE)
results.sort(key=lambda x: x[1])
print("Top 5 LSTM Configurations (by Validation RMSE):\n")
for i, (config, rmse, mae) in enumerate(results[:5]):
print(f"{i+1}. {config} | RMSE: {rmse:.4f} | MAE: {mae:.4f}")
Top 5 LSTM Configurations (by Validation RMSE):
1. {'hidden_size': 64, 'dropout': 0.2, 'lr': 0.001, 'seq_len': 60, 'num_layers': 3} | RMSE: 1.1957 | MAE: 0.9222
2. {'hidden_size': 32, 'dropout': 0.3, 'lr': 0.0005, 'seq_len': 60, 'num_layers': 2} | RMSE: 1.1990 | MAE: 0.9233
3. {'hidden_size': 32, 'dropout': 0.3, 'lr': 0.0005, 'seq_len': 60, 'num_layers': 3} | RMSE: 1.2087 | MAE: 0.9304
4. {'hidden_size': 32, 'dropout': 0.2, 'lr': 0.0005, 'seq_len': 60, 'num_layers': 3} | RMSE: 1.2177 | MAE: 0.9370
5. {'hidden_size': 32, 'dropout': 0.2, 'lr': 0.001, 'seq_len': 60, 'num_layers': 3} | RMSE: 1.2236 | MAE: 0.9321
III . Evaluate Top Configs on Test Set¶
best_config = results[0][0] # Top config
# Re-prepare data with the correct seq_len
X_train, y_train, X_val, y_val, X_test, y_test = prepare_data(
df_scaled, feature_cols, seq_len=best_config['seq_len']
)
train_loader, val_loader = get_data_loaders(X_train, y_train, X_val, y_val, batch_size=32)
test_loader, _ = get_data_loaders(X_test, y_test, X_test, y_test, batch_size=32)
model = LSTMRegressor(
input_size=X_train.shape[2],
hidden_size=best_config['hidden_size'],
dropout=best_config['dropout'],
num_layers=best_config['num_layers']
)
optimizer = torch.optim.Adam(model.parameters(), lr=best_config['lr'])
criterion = nn.MSELoss()
# Retrain full model
model, _, _ = train_model(model, train_loader, val_loader, criterion, optimizer, epochs=50, verbose=False)
# Evaluate on test
mse, rmse, mae = evaluate_model(model, test_loader)
print(f"\nFinal Test Results — Best LSTM Config:")
print(f"LSTM Test MSE : {mse:.4f}")
print(f"LSTM Test RMSE: {rmse:.4f}")
print(f"LSTM Test MAE : {mae:.4f}")
Final Test Results — Best LSTM Config: LSTM Test MSE : 0.4784 LSTM Test RMSE: 0.6917 LSTM Test MAE : 0.5207
After conducting grid search over 32 LSTM configurations, identified the best-performing model with significantly improved results. Compared to the baseline LSTM, which achieved a test RMSE of 0.7529 and MAE of 0.5677, the tuned LSTM reduced the RMSE to 0.6917 and MAE to 0.5207. This improvement demonstrates the value of systematic hyperparameter optimization in enhancing model accuracy for financial time series forecasting.
IV. Saving the experiment¶
import os
import torch
import json
import pickle
import numpy as np
import pandas as pd
def save_experiment(
model, # trained model
config, # best_config dict
train_losses=None,
val_losses=None,
y_true=None,
y_pred=None,
output_dir="experiment_lstm_tuned",
model_filename="project_weights_lstm_tuned.pt"
):
os.makedirs(output_dir, exist_ok=True)
# Save model weights
model_path = os.path.join(output_dir, model_filename)
torch.save(model.state_dict(), model_path)
# Save config
config_path = os.path.join(output_dir, "best_config.json")
with open(config_path, "w") as f:
json.dump(config, f, indent=4)
# Save training history
if train_losses is not None and val_losses is not None:
history_path = os.path.join(output_dir, "training_history.pkl")
with open(history_path, "wb") as f:
pickle.dump({"train_losses": train_losses, "val_losses": val_losses}, f)
# Save predictions
if y_true is not None and y_pred is not None:
df_preds = pd.DataFrame({
"Actual": np.array(y_true),
"Predicted": np.array(y_pred)
})
df_preds.to_csv(os.path.join(output_dir, "test_predictions.csv"), index=False)
print(f"Experiment saved to: {output_dir}")
Saving the experiment
save_experiment(
model=model,
config=best_config,
train_losses=train_losses,
val_losses=val_losses,
y_true=y_true,
y_pred=y_pred_lstm,
output_dir="experiment_lstm_tuned"
)
V. Plots¶
1. Train vs Validation Loss Curve
import pickle
import matplotlib.pyplot as plt
# Load training history
with open("experiment_lstm_tuned/training_history.pkl", "rb") as f:
history = pickle.load(f)
train_losses = history["train_losses"]
val_losses = history["val_losses"]
# Plot
plt.figure(figsize=(8, 4))
plt.plot(train_losses, label="Train Loss")
plt.plot(val_losses, label="Val Loss")
plt.xlabel("Epoch")
plt.ylabel("MSE Loss")
plt.title("LSTM: Training vs Validation Loss")
plt.legend()
plt.grid(True)
plt.tight_layout()
plt.show()
The plot shows a steady decline in training loss, while the validation loss remains relatively flat with high variance. This suggests that while the model is learning on the training data, it may be struggling to generalize, potentially due to overfitting or high variance in the validation set. Techniques like early stopping or regularization could help stabilize performance further.
2. Predicted vs Actual (Line Plot)
import pandas as pd
import matplotlib.pyplot as plt
# Load LSTM predictions
df_preds_lstm = pd.read_csv("experiment_lstm_tuned/test_predictions.csv")
# Plot
plt.figure(figsize=(10, 4))
plt.plot(df_preds_lstm["Actual"], label="Actual", alpha=0.7)
plt.plot(df_preds_lstm["Predicted"], label="Predicted", alpha=0.7)
plt.title("LSTM: Predicted vs Actual Log Returns")
plt.xlabel("Time Step")
plt.ylabel("Log Return")
plt.legend()
plt.grid(True)
plt.tight_layout()
plt.show()
The plot shows that while the predicted values capture the overall trend and central tendency of the actual returns, they tend to smooth out the extreme fluctuations. This is typical in regression-based models, where the focus is on minimizing average error rather than capturing rare, high-volatility events.
3. Scatter Plot: Actual vs Predicted
plt.figure(figsize=(6, 6))
plt.scatter(df_preds_lstm["Actual"], df_preds_lstm["Predicted"], alpha=0.5, color='steelblue')
plt.plot([-2, 2], [-2, 2], color='gray', linestyle='--') # Identity line
plt.title("LSTM: Actual vs Predicted Scatter")
plt.xlabel("Actual Log Return")
plt.ylabel("Predicted Log Return")
plt.grid(True)
plt.axis("equal")
plt.tight_layout()
plt.show()
While the predictions are generally centered around zero and follow the correct trend, they cluster tightly, indicating the model tends to underestimate extreme movements. The scatter around the diagonal suggests reasonable correlation, but limited responsiveness to higher volatility, a common challenge in financial return modeling
I. Find the Best Model by Grid Search¶
1. Define a Hyperparameter Grid
param_grid_gru = {
'hidden_size': [32, 64],
'dropout': [0.2, 0.3],
'lr': [1e-3, 5e-4],
'seq_len': [30, 60],
'num_layers': [2, 3]
}
2. GRU: Hyperparameter Search Loop
results_gru = []
for config in generate_configs(param_grid_gru):
print(f"Running GRU config: {config}")
# Prepare data with the given sequence length
X_train, y_train, X_val, y_val, X_test, y_test = prepare_data(
df_scaled, feature_cols, seq_len=config['seq_len']
)
train_loader, val_loader = get_data_loaders(X_train, y_train, X_val, y_val, batch_size=32)
# Instantiate GRU model
model = GRURegressor(
input_size=X_train.shape[2],
hidden_size=config['hidden_size'],
dropout=config['dropout'],
num_layers=config['num_layers']
)
optimizer = torch.optim.Adam(model.parameters(), lr=config['lr'])
criterion = nn.MSELoss()
# Train GRU model
model, train_losses, val_losses = train_model(
model, train_loader, val_loader,
criterion=criterion,
optimizer=optimizer,
epochs=50,
verbose=False,
log_to_tensorboard=True,
config_name=f"GRU_h{config['hidden_size']}_nl{config['num_layers']}_sl{config['seq_len']}_lr{config['lr']}"
)
# Evaluate on validation set
val_loader_only, _ = get_data_loaders(X_val, y_val, X_val, y_val)
_, rmse, mae = evaluate_model(model, val_loader_only)
results_gru.append((config, rmse, mae))
print(f"GRU RMSE: {rmse:.4f} | MAE: {mae:.4f}\n")
Running GRU config: {'hidden_size': 32, 'dropout': 0.2, 'lr': 0.001, 'seq_len': 30, 'num_layers': 2}
GRU RMSE: 1.3826 | MAE: 1.0499
Running GRU config: {'hidden_size': 32, 'dropout': 0.2, 'lr': 0.001, 'seq_len': 30, 'num_layers': 3}
GRU RMSE: 1.2991 | MAE: 1.0041
Running GRU config: {'hidden_size': 32, 'dropout': 0.2, 'lr': 0.001, 'seq_len': 60, 'num_layers': 2}
GRU RMSE: 1.5463 | MAE: 1.1125
Running GRU config: {'hidden_size': 32, 'dropout': 0.2, 'lr': 0.001, 'seq_len': 60, 'num_layers': 3}
GRU RMSE: 1.2201 | MAE: 0.9406
Running GRU config: {'hidden_size': 32, 'dropout': 0.2, 'lr': 0.0005, 'seq_len': 30, 'num_layers': 2}
GRU RMSE: 1.2816 | MAE: 0.9862
Running GRU config: {'hidden_size': 32, 'dropout': 0.2, 'lr': 0.0005, 'seq_len': 30, 'num_layers': 3}
GRU RMSE: 1.6661 | MAE: 1.1922
Running GRU config: {'hidden_size': 32, 'dropout': 0.2, 'lr': 0.0005, 'seq_len': 60, 'num_layers': 2}
GRU RMSE: 1.7062 | MAE: 1.2154
Running GRU config: {'hidden_size': 32, 'dropout': 0.2, 'lr': 0.0005, 'seq_len': 60, 'num_layers': 3}
GRU RMSE: 1.2622 | MAE: 0.9552
Running GRU config: {'hidden_size': 32, 'dropout': 0.3, 'lr': 0.001, 'seq_len': 30, 'num_layers': 2}
GRU RMSE: 1.3020 | MAE: 0.9947
Running GRU config: {'hidden_size': 32, 'dropout': 0.3, 'lr': 0.001, 'seq_len': 30, 'num_layers': 3}
GRU RMSE: 1.3118 | MAE: 0.9860
Running GRU config: {'hidden_size': 32, 'dropout': 0.3, 'lr': 0.001, 'seq_len': 60, 'num_layers': 2}
GRU RMSE: 1.4116 | MAE: 1.0473
Running GRU config: {'hidden_size': 32, 'dropout': 0.3, 'lr': 0.001, 'seq_len': 60, 'num_layers': 3}
GRU RMSE: 1.2075 | MAE: 0.9313
Running GRU config: {'hidden_size': 32, 'dropout': 0.3, 'lr': 0.0005, 'seq_len': 30, 'num_layers': 2}
GRU RMSE: 1.3214 | MAE: 1.0105
Running GRU config: {'hidden_size': 32, 'dropout': 0.3, 'lr': 0.0005, 'seq_len': 30, 'num_layers': 3}
GRU RMSE: 1.2536 | MAE: 0.9644
Running GRU config: {'hidden_size': 32, 'dropout': 0.3, 'lr': 0.0005, 'seq_len': 60, 'num_layers': 2}
GRU RMSE: 1.3265 | MAE: 1.0065
Running GRU config: {'hidden_size': 32, 'dropout': 0.3, 'lr': 0.0005, 'seq_len': 60, 'num_layers': 3}
GRU RMSE: 1.2024 | MAE: 0.9287
Running GRU config: {'hidden_size': 64, 'dropout': 0.2, 'lr': 0.001, 'seq_len': 30, 'num_layers': 2}
GRU RMSE: 1.5041 | MAE: 1.1170
Running GRU config: {'hidden_size': 64, 'dropout': 0.2, 'lr': 0.001, 'seq_len': 30, 'num_layers': 3}
GRU RMSE: 1.2533 | MAE: 0.9717
Running GRU config: {'hidden_size': 64, 'dropout': 0.2, 'lr': 0.001, 'seq_len': 60, 'num_layers': 2}
GRU RMSE: 1.2930 | MAE: 0.9761
Running GRU config: {'hidden_size': 64, 'dropout': 0.2, 'lr': 0.001, 'seq_len': 60, 'num_layers': 3}
GRU RMSE: 1.4640 | MAE: 1.0619
Running GRU config: {'hidden_size': 64, 'dropout': 0.2, 'lr': 0.0005, 'seq_len': 30, 'num_layers': 2}
GRU RMSE: 1.3986 | MAE: 1.0541
Running GRU config: {'hidden_size': 64, 'dropout': 0.2, 'lr': 0.0005, 'seq_len': 30, 'num_layers': 3}
GRU RMSE: 1.3829 | MAE: 1.0449
Running GRU config: {'hidden_size': 64, 'dropout': 0.2, 'lr': 0.0005, 'seq_len': 60, 'num_layers': 2}
GRU RMSE: 1.4323 | MAE: 1.0535
Running GRU config: {'hidden_size': 64, 'dropout': 0.2, 'lr': 0.0005, 'seq_len': 60, 'num_layers': 3}
GRU RMSE: 1.4907 | MAE: 1.0778
Running GRU config: {'hidden_size': 64, 'dropout': 0.3, 'lr': 0.001, 'seq_len': 30, 'num_layers': 2}
GRU RMSE: 1.2652 | MAE: 0.9760
Running GRU config: {'hidden_size': 64, 'dropout': 0.3, 'lr': 0.001, 'seq_len': 30, 'num_layers': 3}
GRU RMSE: 1.2893 | MAE: 0.9983
Running GRU config: {'hidden_size': 64, 'dropout': 0.3, 'lr': 0.001, 'seq_len': 60, 'num_layers': 2}
GRU RMSE: 1.3142 | MAE: 0.9940
Running GRU config: {'hidden_size': 64, 'dropout': 0.3, 'lr': 0.001, 'seq_len': 60, 'num_layers': 3}
GRU RMSE: 1.3776 | MAE: 1.0213
Running GRU config: {'hidden_size': 64, 'dropout': 0.3, 'lr': 0.0005, 'seq_len': 30, 'num_layers': 2}
GRU RMSE: 1.3987 | MAE: 1.0555
Running GRU config: {'hidden_size': 64, 'dropout': 0.3, 'lr': 0.0005, 'seq_len': 30, 'num_layers': 3}
GRU RMSE: 1.2457 | MAE: 0.9615
Running GRU config: {'hidden_size': 64, 'dropout': 0.3, 'lr': 0.0005, 'seq_len': 60, 'num_layers': 2}
GRU RMSE: 1.3005 | MAE: 0.9880
Running GRU config: {'hidden_size': 64, 'dropout': 0.3, 'lr': 0.0005, 'seq_len': 60, 'num_layers': 3}
GRU RMSE: 1.4227 | MAE: 1.0452
Similar to the LSTM, performed an exhaustive grid search to identify the best GRU configuration across 32 hyperparameter combinations. The tuning process explored variations in hidden size, dropout, learning rate, sequence length, and number of layers.
2.3. TensorBoard Vizualization
%reload_ext tensorboard
%tensorboard --logdir=runs
Output hidden; open in https://colab.research.google.com to view.
2.4 Rank Top Configs (on Validation RMSE)
results_gru.sort(key=lambda x: x[1])
for i, (config, rmse, mae) in enumerate(results_gru[:5]):
print(f"{i+1}. {config} | RMSE: {rmse:.4f} | MAE: {mae:.4f}")
1. {'hidden_size': 32, 'dropout': 0.3, 'lr': 0.0005, 'seq_len': 60, 'num_layers': 3} | RMSE: 1.2024 | MAE: 0.9287
2. {'hidden_size': 32, 'dropout': 0.3, 'lr': 0.001, 'seq_len': 60, 'num_layers': 3} | RMSE: 1.2075 | MAE: 0.9313
3. {'hidden_size': 32, 'dropout': 0.2, 'lr': 0.001, 'seq_len': 60, 'num_layers': 3} | RMSE: 1.2201 | MAE: 0.9406
4. {'hidden_size': 64, 'dropout': 0.3, 'lr': 0.0005, 'seq_len': 30, 'num_layers': 3} | RMSE: 1.2457 | MAE: 0.9615
5. {'hidden_size': 64, 'dropout': 0.2, 'lr': 0.001, 'seq_len': 30, 'num_layers': 3} | RMSE: 1.2533 | MAE: 0.9717
III . Evaluate Top Configs on Test Set¶
# Pick best config
best_gru_config = results_gru[0][0] # Top config
# Prepare data using best seq_len
X_train, y_train, X_val, y_val, X_test, y_test = prepare_data(
df_scaled, feature_cols, seq_len=best_gru_config['seq_len']
)
train_loader, val_loader = get_data_loaders(X_train, y_train, X_val, y_val, batch_size=32)
test_loader, _ = get_data_loaders(X_test, y_test, X_test, y_test, batch_size=32)
# Build GRU model
model_gru = GRURegressor(
input_size=X_train.shape[2],
hidden_size=best_gru_config['hidden_size'],
dropout=best_gru_config['dropout'],
num_layers=best_gru_config['num_layers']
)
# Optimizer and criterion
optimizer = torch.optim.Adam(model_gru.parameters(), lr=best_gru_config['lr'])
criterion = nn.MSELoss()
# Retrain on train+val
model_gru, gru_train_losses, gru_val_losses = train_model(
model_gru, train_loader, val_loader,
criterion=criterion, optimizer=optimizer,
epochs=50, verbose=False
)
# Evaluate on test
mse, rmse, mae = evaluate_model(model_gru, test_loader)
print(f"\nFinal Test Results — Best GRU Config:")
print(f"GRU Test MSE : {mse:.4f}")
print(f"GRU Test RMSE: {rmse:.4f}")
print(f"GRU Test MAE : {mae:.4f}")
Final Test Results — Best GRU Config: GRU Test MSE : 0.4774 GRU Test RMSE: 0.6909 GRU Test MAE : 0.5197
The baseline GRU model yielded a test RMSE of 0.8241 and MAE of 0.6231. After performing grid search across 32 hyperparameter combinations, the best GRU configuration significantly improved performance, reducing the RMSE to 0.6909 and MAE to 0.5197. This result highlights the impact of systematic tuning in enhancing the GRU model’s ability to capture temporal patterns in financial log returns.
IV. Saving the experiment¶
import os
import torch
import json
import pickle
import numpy as np
import pandas as pd
def save_experiment_gru(
model, # trained GRU model
config, # best_config dict
train_losses=None,
val_losses=None,
y_true=None,
y_pred=None,
output_dir="experiment_gru_tuned",
model_filename="project_weights_gru_tuned.pt"
):
os.makedirs(output_dir, exist_ok=True)
# Save model weights
model_path = os.path.join(output_dir, model_filename)
torch.save(model.state_dict(), model_path)
# Save config
config_path = os.path.join(output_dir, "best_config.json")
with open(config_path, "w") as f:
json.dump(config, f, indent=4)
# Save training history
if train_losses is not None and val_losses is not None:
history_path = os.path.join(output_dir, "training_history.pkl")
with open(history_path, "wb") as f:
pickle.dump({"train_losses": train_losses, "val_losses": val_losses}, f)
# Save predictions
if y_true is not None and y_pred is not None:
df_preds = pd.DataFrame({
"Actual": np.array(y_true),
"Predicted": np.array(y_pred)
})
df_preds.to_csv(os.path.join(output_dir, "test_predictions.csv"), index=False)
print(f"GRU experiment saved to: {output_dir}")
Saving the experiment
# Predict on test set
model_gru.eval()
with torch.no_grad():
y_pred_tensor = model_gru(torch.tensor(X_test, dtype=torch.float32).to(device)).squeeze()
y_true = y_test # already numpy
y_pred_gru = y_pred_tensor.cpu().numpy()
# Save everything
save_experiment_gru(
model=model_gru,
config=best_gru_config,
train_losses=gru_train_losses,
val_losses=gru_val_losses,
y_true=y_true,
y_pred=y_pred_gru,
output_dir="experiment_gru_tuned",
model_filename="project_weights_gru_tuned.pt"
)
V. Plots¶
- Training vs Validation Loss Curve
import matplotlib.pyplot as plt
import pickle
# Load training history
with open("experiment_gru_tuned/training_history.pkl", "rb") as f:
history = pickle.load(f)
gru_train_losses = history["train_losses"]
gru_val_losses = history["val_losses"]
# Plot
plt.figure(figsize=(8, 4))
plt.plot(gru_train_losses, label="Train Loss")
plt.plot(gru_val_losses, label="Val Loss")
plt.xlabel("Epoch")
plt.ylabel("MSE Loss")
plt.title("GRU: Training vs Validation Loss")
plt.legend()
plt.grid(True)
plt.show()
The training loss shows a gradual decline, indicating the model is learning effectively on the training data. However, the validation loss exhibits noticeable variance and instability, suggesting potential overfitting or high sensitivity to noise in the validation set. This could potentially be improved with techniques like early stopping, regularization, or more robust validation splits.
- Predicted vs Actual (Line Plot)
import pandas as pd
# Load predictions
df_preds = pd.read_csv("experiment_gru_tuned/test_predictions.csv")
# Plot
plt.figure(figsize=(10, 4))
plt.plot(df_preds["Actual"], label="Actual", alpha=0.7)
plt.plot(df_preds["Predicted"], label="Predicted", alpha=0.7)
plt.title("GRU: Predicted vs Actual Log Returns")
plt.xlabel("Time Step")
plt.ylabel("Log Return")
plt.legend()
plt.grid(True)
plt.show()
While the GRU model captures the overall direction and trend centrality of the series, it underestimates the magnitude of rapid changes and high-volatility spikes. This results in smoother predicted values that track the general trend but miss sharp fluctuations, a common tradeoff in deep learning models trained with MSE-based objectives on noisy financial data.
- Scatter Plot: Actual vs Predicted
plt.figure(figsize=(6, 6))
plt.scatter(df_preds["Actual"], df_preds["Predicted"], alpha=0.5, color='orange')
plt.plot([-2, 2], [-2, 2], color='gray', linestyle='--') # Identity line
plt.title("GRU: Actual vs Predicted Scatter")
plt.xlabel("Actual Log Return")
plt.ylabel("Predicted Log Return")
plt.grid(True)
plt.axis("equal")
plt.xlim(-2, 2)
plt.ylim(-2, 2)
plt.show()
WARNING:matplotlib.axes._base:Ignoring fixed x limits to fulfill fixed data aspect with adjustable data limits.
The predictions are tightly clustered around zero, indicating the GRU model tends to regress toward the mean and underestimates the magnitude of larger log returns. This conservative behavior is common in financial models trained to minimize MSE, especially when the data is noisy and exhibits heavy-tailed distributions.
Transfomer¶
Transformer (Vanilla)¶
I. Define the Vanilla Transformer¶
import torch
import torch.nn as nn
import math
# Positional encoding module for adding temporal information to input embeddings
class PositionalEncoding(nn.Module):
def __init__(self, d_model, max_len=5000):
super().__init__()
# Create a matrix of shape (max_len, d_model) for positional encodings
pe = torch.zeros(max_len, d_model)
# Generate position indices (0 to max_len - 1) as a column vector
position = torch.arange(0, max_len).unsqueeze(1)
# Compute the denominator term for sine/cosine frequencies
div_term = torch.exp(torch.arange(0, d_model, 2) * (-math.log(10000.0) / d_model))
# Apply sine to even indices in the embedding dimension
pe[:, 0::2] = torch.sin(position * div_term)
# Apply cosine to odd indices in the embedding dimension
pe[:, 1::2] = torch.cos(position * div_term)
self.pe = pe.unsqueeze(0) # Add a batch dimension (1, max_len, d_model)
def forward(self, x):
# Add positional encodings to the input tensor
x = x + self.pe[:, :x.size(1)].to(x.device) # x: (batch_size, seq_len, d_model)
return x
class TimeSeriesTransformer(nn.Module):
def __init__(self, input_dim, model_dim=64, num_heads=4, num_layers=2, dropout=0.1):
super().__init__()
# Project raw input features into model dimension space
self.input_proj = nn.Linear(input_dim, model_dim)
# Add positional encoding to the projected inputs
self.pos_encoder = PositionalEncoding(model_dim)
# Define a single Transformer encoder layer
encoder_layer = nn.TransformerEncoderLayer(
d_model=model_dim,
nhead=num_heads,
dim_feedforward=128,
dropout=dropout,
batch_first=True # Enable (batch, seq, feature) input format
)
# Stack multiple encoder layers to form the full Transformer encoder
self.transformer = nn.TransformerEncoder(encoder_layer, num_layers=num_layers)
# Output head: MLP to map final hidden state to a scalar prediction
self.head = nn.Sequential(
nn.Linear(model_dim, 32),
nn.ReLU(),
nn.Linear(32, 1)
)
def forward(self, x):
x = self.input_proj(x) # Project input to model dimension
x = self.pos_encoder(x) # Add positional encoding
x = self.transformer(x) # Pass through Transformer encoder
out = x[:, -1, :] # Use the last token's output as the representation for prediction
return self.head(out) # Predict the next value using the MLP head
II. Training & Evaluation Setup¶
import time
from torch.utils.tensorboard import SummaryWriter
# Training loop for a PyTorch model with TensorBoard logging
def train_model(
model,
train_loader,
val_loader,
criterion,
optimizer,
epochs=50,
device=device,
log_to_tensorboard=True,
config_name="transformer_default",
verbose=True
):
model.to(device)
train_losses, val_losses = [], []
# Setup TensorBoard writer
writer = SummaryWriter(log_dir=f"runs/{config_name}") if log_to_tensorboard else None
for epoch in range(epochs):
model.train()
train_loss = 0.0
# Training step
for xb, yb in train_loader:
xb, yb = xb.to(device), yb.to(device)
optimizer.zero_grad()
preds = model(xb)
loss = criterion(preds, yb)
loss.backward()
optimizer.step()
train_loss += loss.item()
train_loss /= len(train_loader)
train_losses.append(train_loss)
# Validation step
model.eval()
val_loss = 0.0
with torch.no_grad():
for xb, yb in val_loader:
xb, yb = xb.to(device), yb.to(device)
preds = model(xb)
loss = criterion(preds, yb)
val_loss += loss.item()
val_loss /= len(val_loader)
val_losses.append(val_loss)
# Logging and output
if verbose:
print(f"Epoch {epoch+1}/{epochs} | Train Loss: {train_loss:.4f} | Val Loss: {val_loss:.4f}")
if writer:
writer.add_scalar("Loss/Train", train_loss, epoch)
writer.add_scalar("Loss/Val", val_loss, epoch)
# Close TensorBoard writer
if writer:
writer.close()
return model, train_losses, val_losses
III. Define evaluate_model() with Optional Predictions¶
import numpy as np
# Evaluation function to compute MSE, RMSE, MAE (with predictions)
def evaluate_model(model, data_loader, criterion=nn.MSELoss(), device=device, return_predictions=False):
model.eval()
model.to(device)
preds, targets = [], []
# Inference loop (no gradients)
with torch.no_grad():
for xb, yb in data_loader:
xb, yb = xb.to(device), yb.to(device)
pred = model(xb)
preds.append(pred.cpu())
targets.append(yb.cpu())
# Concatenate all predictions and targets
preds = torch.cat(preds).squeeze()
targets = torch.cat(targets).squeeze()
# Compute evaluation metrics
mse = torch.mean((preds - targets) ** 2).item()
rmse = np.sqrt(mse)
mae = torch.mean(torch.abs(preds - targets)).item()
# return raw predictions and targets
if return_predictions:
return mse, rmse, mae, preds.numpy(), targets.numpy()
return mse, rmse, mae
IV. Run the Vanilla Transformer Training¶
# Data
seq_len = 60
X_train, y_train, X_val, y_val, X_test, y_test = prepare_data(
df_scaled, feature_cols, seq_len=seq_len
)
train_loader, val_loader = get_data_loaders(X_train, y_train, X_val, y_val, batch_size=32)
test_loader, _ = get_data_loaders(X_test, y_test, X_test, y_test, batch_size=32)
# Model
model = TimeSeriesTransformer(
input_dim=input_size,
model_dim=64,
num_heads=4,
num_layers=2,
dropout=0.1
)
# Optimizer and Loss
optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)
criterion = nn.MSELoss()
# Train
model, train_losses, val_losses = train_model(
model, train_loader, val_loader,
criterion=criterion,
optimizer=optimizer,
epochs=50,
log_to_tensorboard=True,
config_name="transformer_baseline"
)
# Evaluate
mse, rmse, mae, y_pred, y_true = evaluate_model(model, test_loader, return_predictions=True)
print(f"\n[Transformer] Test MSE: {mse:.4f} | RMSE: {rmse:.4f} | MAE: {mae:.4f}")
Epoch 1/50 | Train Loss: 1.0560 | Val Loss: 1.4010 Epoch 2/50 | Train Loss: 1.0423 | Val Loss: 1.4353 Epoch 3/50 | Train Loss: 1.0285 | Val Loss: 1.4041 Epoch 4/50 | Train Loss: 1.0015 | Val Loss: 1.8611 Epoch 5/50 | Train Loss: 1.0466 | Val Loss: 1.3770 Epoch 6/50 | Train Loss: 1.0287 | Val Loss: 1.4237 Epoch 7/50 | Train Loss: 1.0182 | Val Loss: 1.5937 Epoch 8/50 | Train Loss: 1.0014 | Val Loss: 1.4345 Epoch 9/50 | Train Loss: 1.0035 | Val Loss: 1.4745 Epoch 10/50 | Train Loss: 0.9816 | Val Loss: 2.2235 Epoch 11/50 | Train Loss: 1.2082 | Val Loss: 1.5907 Epoch 12/50 | Train Loss: 1.0278 | Val Loss: 1.3937 Epoch 13/50 | Train Loss: 1.0224 | Val Loss: 1.4201 Epoch 14/50 | Train Loss: 0.9977 | Val Loss: 1.5446 Epoch 15/50 | Train Loss: 1.0044 | Val Loss: 1.5680 Epoch 16/50 | Train Loss: 0.9909 | Val Loss: 2.3362 Epoch 17/50 | Train Loss: 0.9856 | Val Loss: 1.4531 Epoch 18/50 | Train Loss: 0.9765 | Val Loss: 2.3087 Epoch 19/50 | Train Loss: 1.0040 | Val Loss: 1.3811 Epoch 20/50 | Train Loss: 1.0108 | Val Loss: 1.3955 Epoch 21/50 | Train Loss: 0.9922 | Val Loss: 1.6466 Epoch 22/50 | Train Loss: 0.9924 | Val Loss: 1.3715 Epoch 23/50 | Train Loss: 0.9860 | Val Loss: 1.4310 Epoch 24/50 | Train Loss: 0.9465 | Val Loss: 1.3904 Epoch 25/50 | Train Loss: 1.0154 | Val Loss: 1.5313 Epoch 26/50 | Train Loss: 1.0188 | Val Loss: 1.4463 Epoch 27/50 | Train Loss: 1.0072 | Val Loss: 1.3698 Epoch 28/50 | Train Loss: 1.0430 | Val Loss: 1.3724 Epoch 29/50 | Train Loss: 1.0412 | Val Loss: 1.3713 Epoch 30/50 | Train Loss: 1.0392 | Val Loss: 1.3764 Epoch 31/50 | Train Loss: 1.0349 | Val Loss: 1.3832 Epoch 32/50 | Train Loss: 1.0198 | Val Loss: 1.4785 Epoch 33/50 | Train Loss: 1.0165 | Val Loss: 1.3724 Epoch 34/50 | Train Loss: 1.0106 | Val Loss: 1.3792 Epoch 35/50 | Train Loss: 1.0008 | Val Loss: 1.4077 Epoch 36/50 | Train Loss: 0.9362 | Val Loss: 1.4173 Epoch 37/50 | Train Loss: 0.9250 | Val Loss: 1.3831 Epoch 38/50 | Train Loss: 0.9485 | Val Loss: 1.4550 Epoch 39/50 | Train Loss: 1.0614 | Val Loss: 1.3693 Epoch 40/50 | Train Loss: 1.0400 | Val Loss: 1.3721 Epoch 41/50 | Train Loss: 1.0382 | Val Loss: 1.3725 Epoch 42/50 | Train Loss: 1.0391 | Val Loss: 1.3731 Epoch 43/50 | Train Loss: 1.0357 | Val Loss: 1.3729 Epoch 44/50 | Train Loss: 1.0270 | Val Loss: 1.3948 Epoch 45/50 | Train Loss: 0.9587 | Val Loss: 1.4321 Epoch 46/50 | Train Loss: 0.9860 | Val Loss: 1.3714 Epoch 47/50 | Train Loss: 0.9215 | Val Loss: 2.2381 Epoch 48/50 | Train Loss: 1.0207 | Val Loss: 1.3849 Epoch 49/50 | Train Loss: 1.0562 | Val Loss: 1.3737 Epoch 50/50 | Train Loss: 1.0406 | Val Loss: 1.3745 [Transformer] Test MSE: 0.4800 | RMSE: 0.6929 | MAE: 0.5204
The vanilla Transformer model achieved a test RMSE of 0.6929 and MAE of 0.5204 after 50 epochs. Its performance is on par with the best-tuned LSTM and GRU models, indicating its ability to effectively capture temporal dependencies even without recurrence. This provides a strong baseline for exploring more advanced transformer-based architectures with enhanced temporal encoding and regularization.
V. Saving the Experiment¶
import os
import torch
import json
import pickle
import numpy as np
import pandas as pd
def save_experiment(
model,
config,
train_losses=None,
val_losses=None,
y_true=None,
y_pred=None,
output_dir="experiment_transformer_vanilla",
model_filename="project_weights_transformer_vanilla.pt"
):
os.makedirs(output_dir, exist_ok=True)
# Save model weights
model_path = os.path.join(output_dir, model_filename)
torch.save(model.state_dict(), model_path)
# Save config
config_path = os.path.join(output_dir, "best_config.json")
with open(config_path, "w") as f:
json.dump(config, f, indent=4)
# Save training history
if train_losses is not None and val_losses is not None:
history_path = os.path.join(output_dir, "training_history.pkl")
with open(history_path, "wb") as f:
pickle.dump({"train_losses": train_losses, "val_losses": val_losses}, f)
# Save predictions
if y_true is not None and y_pred is not None:
df_preds = pd.DataFrame({
"Actual": np.array(y_true),
"Predicted": np.array(y_pred)
})
df_preds.to_csv(os.path.join(output_dir, "test_predictions.csv"), index=False)
print(f"Transformer experiment saved to: {output_dir}")
Saving...
# Define the config
transformer_config = {
"model_dim": 64,
"num_heads": 4,
"num_layers": 2,
"dropout": 0.1,
"seq_len": 60,
"lr": 0.001
}
# Save experiment
save_experiment(
model=model,
config=transformer_config,
train_losses=train_losses,
val_losses=val_losses,
y_true=y_true,
y_pred=y_pred,
output_dir="experiment_transformer_vanilla",
model_filename="project_weights_transformer_vanilla.pt"
)
VI. Plots¶
import os
import torch
import json
import pickle
import pandas as pd
import matplotlib.pyplot as plt
# Define paths
exp_dir = "experiment_transformer_vanilla"
history_path = os.path.join(exp_dir, "training_history.pkl")
preds_path = os.path.join(exp_dir, "test_predictions.csv")
# Load training history
with open(history_path, "rb") as f:
history = pickle.load(f)
train_losses = history["train_losses"]
val_losses = history["val_losses"]
# Load predictions
df_preds = pd.read_csv(preds_path)
1. Tensorboard Visualization
%reload_ext tensorboard
%tensorboard --logdir=runs
Output hidden; open in https://colab.research.google.com to view.
- Plot Train vs. Validation Loss
plt.figure(figsize=(8, 4))
plt.plot(train_losses, label="Train Loss")
plt.plot(val_losses, label="Val Loss")
plt.xlabel("Epoch")
plt.ylabel("MSE Loss")
plt.title("Transformer: Training & Validation Loss")
plt.legend()
plt.grid(True)
plt.tight_layout()
plt.show()
The training loss shows a stable downward trend, while the validation loss fluctuates significantly, indicating potential sensitivity to initialization or data noise. Despite this variance, the model converges to a competitive performance level, suggesting that the Transformer can generalize well with further tuning or regularization.
- Predicted vs Actual Log Returns (Line Plot)
plt.figure(figsize=(10, 4))
plt.plot(df_preds["Actual"], label="Actual", alpha=0.7)
plt.plot(df_preds["Predicted"], label="Predicted", alpha=0.7)
plt.title("Transformer: Predicted vs Actual Log Returns")
plt.xlabel("Time Step")
plt.ylabel("Log Return")
plt.legend()
plt.grid(True)
plt.show()
The predicted values remain close to the zero line, indicating the Transformer struggles to capture the magnitude of volatility in the data. While it approximates the trend center well, it fails to react to large fluctuations, a common limitation when models are trained with MSE loss on noisy financial series. This reinforces the need for uncertainty-aware or regularized architectures.
- Scatter Plot (Actual vs Predicted)
plt.figure(figsize=(6, 6))
plt.scatter(df_preds["Actual"], df_preds["Predicted"], alpha=0.5, color='green')
plt.plot([-2, 2], [-2, 2], color='gray', linestyle='--')
plt.title("Transformer: Actual vs Predicted Scatter")
plt.xlabel("Actual Log Return")
plt.ylabel("Predicted Log Return")
plt.axis("equal")
plt.xlim(-2, 2)
plt.ylim(-2, 2)
plt.grid(True)
plt.show()
WARNING:matplotlib.axes._base:Ignoring fixed x limits to fulfill fixed data aspect with adjustable data limits.
The predictions are heavily clustered near zero, highlighting the model’s tendency to underreact to large deviations. This underdispersion suggests that while the Transformer captures the central trend well, it struggles with accurately modeling high-volatility movements underscoring the importance of incorporating uncertainty estimation or variance-aware loss functions in future enhancements.
I. Define the Transformer¶
import torch
import torch.nn as nn
import math
# Positional encoding to inject temporal order into input embeddings
class PositionalEncoding(nn.Module):
def __init__(self, d_model, max_len=1000):
super().__init__()
pe = torch.zeros(max_len, d_model)
position = torch.arange(0, max_len).unsqueeze(1)
div_term = torch.exp(torch.arange(0, d_model, 2) * (-math.log(10000.0) / d_model))
pe[:, 0::2] = torch.sin(position * div_term)
pe[:, 1::2] = torch.cos(position * div_term)
self.pe = pe.unsqueeze(0) # (1, max_len, d_model)
def forward(self, x):
return x + self.pe[:, :x.size(1)].to(x.device)
# Transformer model with dropout regularization and MC Dropout for uncertainty estimation
class TransformerRegularized(nn.Module):
def __init__(
self,
input_dim,
model_dim=64,
num_heads=4,
num_layers=3,
dropout=0.2,
ff_dim=128,
mc_dropout= False # activates dropout during inference
):
super().__init__()
self.mc_dropout = mc_dropout # enables dropout at inference
# Input projection + normalization + regularization
self.input_proj = nn.Sequential(
nn.Linear(input_dim, model_dim),
nn.LayerNorm(model_dim),
nn.Dropout(dropout)
)
self.pos_encoder = PositionalEncoding(model_dim)
# Transformer encoder with pre-layer normalization for better convergence
encoder_layer = nn.TransformerEncoderLayer(
d_model=model_dim,
nhead=num_heads,
dim_feedforward=ff_dim,
dropout=dropout,
batch_first=True,
norm_first=True
)
self.transformer = nn.TransformerEncoder(encoder_layer, num_layers=num_layers)
# Output regression head with additional LayerNorm and Dropout
self.regressor = nn.Sequential(
nn.LayerNorm(model_dim),
nn.Linear(model_dim, 32),
nn.ReLU(),
nn.Dropout(dropout),
nn.Linear(32, 1)
)
def forward(self, x):
x = self.input_proj(x)
x = self.pos_encoder(x)
x = self.transformer(x)
x = x[:, -1, :] # use the final time step representation
# Enable dropout during inference for MC Dropout sampling
if self.mc_dropout:
for m in self.regressor:
if isinstance(m, nn.Dropout):
m.train() # keep dropout active during inference
return self.regressor(x)
Monte Carlo Dropout Prediction Function
import torch
import numpy as np
def predict_mc_dropout(model, data_loader, device, n_samples=100):
model.eval()
model.mc_dropout = True
model.to(device)
all_preds = []
with torch.no_grad():
for _ in range(n_samples):
preds = []
for xb, _ in data_loader:
xb = xb.to(device)
pred = model(xb)
preds.append(pred.cpu())
preds = torch.cat(preds, dim=0).squeeze().numpy()
all_preds.append(preds)
return np.array(all_preds) # shape: [n_samples, n_points]
Compute VaR & Expected Shortfall
import numpy as np
def compute_var_es_mc(predictions, alpha=0.05):
mean = predictions.mean(axis=0)
std = predictions.std(axis=0)
# Compute VaR at each time step (percentile across samples)
var = np.percentile(predictions, 100 * alpha, axis=0)
# Compute ES per time step
es = []
for t in range(predictions.shape[1]):
below_var = predictions[:, t][predictions[:, t] < var[t]]
es_t = below_var.mean() if len(below_var) > 0 else var[t] # fallback to VaR if no values below
es.append(es_t)
es = np.array(es)
return mean, std, var, es
II. Training & Evaluation Setup¶
import torch
import time
from torch.utils.tensorboard import SummaryWriter
def train_model_with_early_stopping(
model,
train_loader,
val_loader,
criterion,
optimizer,
epochs=100,
device=device,
patience=10,
log_to_tensorboard=True,
config_name="transformer_regularized"
):
model.to(device)
best_val_loss = float('inf')
best_model_state = None
counter = 0
train_losses, val_losses = [], []
if log_to_tensorboard:
writer = SummaryWriter(log_dir=f"runs/{config_name}")
for epoch in range(1, epochs + 1):
model.train()
epoch_train_loss = 0
for xb, yb in train_loader:
xb, yb = xb.to(device), yb.to(device)
optimizer.zero_grad()
preds = model(xb)
loss = criterion(preds, yb)
loss.backward()
optimizer.step()
epoch_train_loss += loss.item() * xb.size(0)
epoch_train_loss /= len(train_loader.dataset)
train_losses.append(epoch_train_loss)
# Validation
model.eval()
epoch_val_loss = 0
with torch.no_grad():
for xb, yb in val_loader:
xb, yb = xb.to(device), yb.to(device)
preds = model(xb)
loss = criterion(preds, yb)
epoch_val_loss += loss.item() * xb.size(0)
epoch_val_loss /= len(val_loader.dataset)
val_losses.append(epoch_val_loss)
# Logging
if log_to_tensorboard:
writer.add_scalars(f"{config_name}/loss", {
"Train": epoch_train_loss,
"Val": epoch_val_loss
}, epoch)
print(f"Epoch {epoch:02d}/{epochs} | Train Loss: {epoch_train_loss:.4f} | Val Loss: {epoch_val_loss:.4f}")
# Early stopping check
if epoch_val_loss < best_val_loss:
best_val_loss = epoch_val_loss
best_model_state = model.state_dict()
counter = 0
else:
counter += 1
if counter >= patience:
print(f"Early stopping at epoch {epoch}")
break
if log_to_tensorboard:
writer.close()
# Restore best model
if best_model_state is not None:
model.load_state_dict(best_model_state)
return model, train_losses, val_losses
IV. Run the Transformer Training¶
SEQ_LEN = 60
X_train, y_train, X_val, y_val, X_test, y_test = prepare_data(
df_scaled, feature_cols, seq_len=SEQ_LEN
)
train_loader, val_loader = get_data_loaders(X_train, y_train, X_val, y_val, batch_size=32)
test_loader, _ = get_data_loaders(X_test, y_test, X_test, y_test, batch_size=32)
input_dim = X_train.shape[2]
model = TransformerRegularized(
input_dim=input_dim,
model_dim=128,
num_heads=4,
num_layers=4,
dropout=0.1,
ff_dim=256,
mc_dropout=True
)
import torch.nn as nn
criterion = nn.MSELoss()
optimizer = torch.optim.Adam(model.parameters(), lr=0.0005)
model, train_losses, val_losses = train_model_with_early_stopping(
model=model,
train_loader=train_loader,
val_loader=val_loader,
criterion=criterion,
optimizer=optimizer,
epochs=100,
patience=10,
log_to_tensorboard=True,
config_name="TransformerRegularized"
)
mse, rmse, mae, y_pred, y_true = evaluate_model(
model, test_loader, criterion=nn.MSELoss(), return_predictions=True
)
print(f"\nFinal Test Results — TransformerRegularized:")
print(f"MSE : {mse:.4f}")
print(f"RMSE: {rmse:.4f}")
print(f"MAE : {mae:.4f}")
/usr/local/lib/python3.11/dist-packages/torch/nn/modules/transformer.py:385: UserWarning: enable_nested_tensor is True, but self.use_nested_tensor is False because encoder_layer.norm_first was True warnings.warn(
Epoch 01/100 | Train Loss: 1.0498 | Val Loss: 1.4501 Epoch 02/100 | Train Loss: 1.0412 | Val Loss: 1.4330 Epoch 03/100 | Train Loss: 1.0323 | Val Loss: 1.5041 Epoch 04/100 | Train Loss: 1.0415 | Val Loss: 1.4737 Epoch 05/100 | Train Loss: 1.0337 | Val Loss: 1.4898 Epoch 06/100 | Train Loss: 1.0331 | Val Loss: 2.2935 Epoch 07/100 | Train Loss: 1.0383 | Val Loss: 1.4732 Epoch 08/100 | Train Loss: 1.0224 | Val Loss: 1.4272 Epoch 09/100 | Train Loss: 1.0367 | Val Loss: 1.4290 Epoch 10/100 | Train Loss: 1.0325 | Val Loss: 1.4468 Epoch 11/100 | Train Loss: 1.0189 | Val Loss: 1.4796 Epoch 12/100 | Train Loss: 1.0111 | Val Loss: 1.5031 Epoch 13/100 | Train Loss: 1.0042 | Val Loss: 1.4885 Epoch 14/100 | Train Loss: 1.0178 | Val Loss: 1.5225 Epoch 15/100 | Train Loss: 1.0110 | Val Loss: 1.4446 Epoch 16/100 | Train Loss: 0.9777 | Val Loss: 1.6107 Epoch 17/100 | Train Loss: 0.9817 | Val Loss: 1.7964 Epoch 18/100 | Train Loss: 1.0004 | Val Loss: 2.4265 Early stopping at epoch 18 Final Test Results — TransformerRegularized: MSE : 0.4953 RMSE: 0.7038 MAE : 0.5318
The regularized Transformer with Monte Carlo Dropout achieved a test RMSE of 0.7038 and MAE of 0.5318. While slightly less accurate than the best-tuned GRU/LSTM and vanilla Transformer, it offers the added advantage of predictive uncertainty through stochastic forward passes. This trade-off between slight performance cost and richer model interpretability is valuable in financial forecasting, where confidence intervals and risk quantification are critical. The model serves as a strong foundation for further risk-aware extensions.
Compute VaR and ES
mc_preds = predict_mc_dropout(model, test_loader, device, n_samples=100)
mean_pred, std_pred, var_95, es_95 = compute_var_es_mc(mc_preds, alpha=0.05)
Plotting
plt.figure(figsize=(12, 5))
plt.plot(y_true, label="Actual", alpha=0.8)
plt.plot(mean_pred, label="Mean Prediction", color='orange')
plt.plot(var_95, label="VaR (95%)", color='red', linestyle='--')
plt.plot(es_95, label="ES (95%)", color='purple', linestyle='dashed')
plt.fill_between(range(len(mean_pred)), mean_pred - 2*std_pred, mean_pred + 2*std_pred, alpha=0.2, label="±2 std")
plt.title("MC Dropout: Mean Prediction, VaR, and Expected Shortfall")
plt.legend()
plt.grid(True)
plt.show()
This plot demonstrates the model’s ability to not only provide point forecasts but also meaningful confidence intervals. VaR and Expected Shortfall dynamically adjust based on model uncertainty, particularly in high-volatility regions. While the predictive mean is conservative and smoother than actual returns, the model offers valuable insight into potential downside risk, essential for risk-aware decision-making.
V. Save the Experiment¶
save_experiment(
model=model,
config={"seq_len": SEQ_LEN, "model_dim": 128, "num_heads": 4, "num_layers": 4,
"dropout": 0.1, "ff_dim": 256, "lr": 0.0005},
train_losses=train_losses,
val_losses=val_losses,
y_true=y_true,
y_pred=y_pred,
output_dir="experiment_transformer_regularized",
model_filename="project_weights_transformer_regularized.pt"
)
Final experiment saved to: experiment_transformer_regularized
VI. Plots¶
1. TensorBoard Visualization
%reload_ext tensorboard
%tensorboard --logdir=runs
Output hidden; open in https://colab.research.google.com to view.
2. Plot: Training vs Validation Loss
import pickle
import matplotlib.pyplot as plt
# Load training history
with open("experiment_transformer_regularized/training_history.pkl", "rb") as f:
history = pickle.load(f)
train_losses = history["train_losses"]
val_losses = history["val_losses"]
# Plot
plt.figure(figsize=(8, 4))
plt.plot(train_losses, label="Train Loss")
plt.plot(val_losses, label="Val Loss")
plt.xlabel("Epoch")
plt.ylabel("MSE Loss")
plt.title("Transformer (Regularized): Training & Validation Loss")
plt.legend()
plt.grid(True)
plt.tight_layout()
plt.show()
The training loss decreases steadily, while the validation loss exhibits fluctuations, a typical pattern when dropout-based regularization is active. Early stopping was triggered at epoch 18 once the validation loss began increasing persistently, preventing further overfitting. This strategy preserved generalization and resulted in a final test RMSE of 0.7038, with added benefits of uncertainty-aware forecasting and risk quantification via VaR and Expected Shortfall.
3. Plot: Predicted vs Actual (Line Plot)
import pandas as pd
# Load predictions
df_preds = pd.read_csv("experiment_transformer_regularized/test_predictions.csv")
# Plot
plt.figure(figsize=(10, 4))
plt.plot(df_preds["Actual"], label="Actual", alpha=0.7)
plt.plot(df_preds["Predicted"], label="Predicted", alpha=0.7)
plt.title("Transformer (Regularized): Predicted vs Actual Log Returns")
plt.xlabel("Time Step")
plt.ylabel("Log Return")
plt.legend()
plt.grid(True)
plt.show()
The predicted series closely follows the overall trend of the actual returns, especially in low-volatility periods. While extreme fluctuations are still underpredicted, the model demonstrates improved responsiveness compared to previous baselines. This smoothing is expected in probabilistic forecasts, where the mean prediction serves as a central tendency, and uncertainty is captured separately via predictive intervals. The result supports the model’s use in risk-aware forecasting rather than exact value prediction.
4. Plot: Scatter — Actual vs Predicted
plt.figure(figsize=(6, 6))
plt.scatter(df_preds["Actual"], df_preds["Predicted"], alpha=0.5, color='purple')
plt.plot([-2, 2], [-2, 2], color='gray', linestyle='--') # identity line
plt.title("Transformer (Regularized): Actual vs Predicted Scatter")
plt.xlabel("Actual Log Return")
plt.ylabel("Predicted Log Return")
plt.grid(True)
plt.axis("equal")
plt.tight_layout()
plt.show()
The predictions show a clear concentration around zero, consistent with a tendency to regress toward the mean. While there's reasonable alignment with the diagonal in moderate return ranges, the model underpredicts more extreme values, especially in the tails. This conservative pattern is expected in models optimized for risk-aware forecasting, where the goal is to capture distributional characteristics rather than individual spikes. The spread is tighter than in prior baselines, suggesting improved calibration.
I. Model Architecture¶
Final Architecture: Patch-based Transformer with Monte Carlo Dropout¶
This model introduces a patch-based input strategy inspired by Vision Transformers (ViT) and recent time series architectures like PatchTST. It embeds non-overlapping time patches, applies positional encoding, and processes them with stacked Transformer encoder blocks. Monte Carlo Dropout is enabled during inference to support predictive uncertainty estimation and risk-aware forecasting.
import torch
import torch.nn as nn
import math
# Patch-based input embedding inspired by Vision Transformers (ViT)
class PatchEmbedding(nn.Module):
def __init__(self, input_dim, patch_len, model_dim):
super().__init__()
self.patch_len = patch_len
self.proj = nn.Linear(input_dim * patch_len, model_dim)
def forward(self, x):
# x: [B, T, D] => reshape into non-overlapping patches
B, T, D = x.shape
assert T % self.patch_len == 0, "Time series length must be divisible by patch_len"
num_patches = T // self.patch_len
x = x.view(B, num_patches, D * self.patch_len)
return self.proj(x)
# Same as before, add position information to each token
class PositionalEncoding(nn.Module):
def __init__(self, d_model, max_len=500):
super().__init__()
pe = torch.zeros(max_len, d_model)
position = torch.arange(0, max_len).unsqueeze(1)
div_term = torch.exp(torch.arange(0, d_model, 2) * (-math.log(10000.0) / d_model))
pe[:, 0::2] = torch.sin(position * div_term)
pe[:, 1::2] = torch.cos(position * div_term)
self.pe = pe.unsqueeze(0)
def forward(self, x):
return x + self.pe[:, :x.size(1)].to(x.device)
# Patch-based Transformer with MC Dropout and global average pooling
class ForecastingTransformer(nn.Module):
def __init__(
self,
input_dim,
model_dim=128,
patch_len=10,
num_heads=4,
num_layers=3,
ff_dim=256,
dropout=0.1,
output_dim=1,
mc_dropout=True
):
super().__init__()
self.mc_dropout = mc_dropout
# Patchify time series and project into model_dim space
self.embedding = PatchEmbedding(input_dim, patch_len, model_dim)
# Add positional encoding to patches
self.pos_encoding = PositionalEncoding(model_dim)
# Transformer encoder with norm-first setting
encoder_layer = nn.TransformerEncoderLayer(
d_model=model_dim,
nhead=num_heads,
dim_feedforward=ff_dim,
dropout=dropout,
batch_first=True,
norm_first=True
)
self.transformer = nn.TransformerEncoder(encoder_layer, num_layers=num_layers)
# Output head with LayerNorm, Dropout, and MLP
self.output_head = nn.Sequential(
nn.LayerNorm(model_dim),
nn.Dropout(dropout),
nn.Linear(model_dim, 64),
nn.ReLU(),
nn.Dropout(dropout),
nn.Linear(64, output_dim)
)
def forward(self, x):
# x: [B, T, D] => [B, Num_Patches, Model_Dim]
x = self.embedding(x)
x = self.pos_encoding(x)
x = self.transformer(x)
# Global average pooling over patches instead of last token
x = x.mean(dim=1)
# Enable MC Dropout during inference
if self.mc_dropout:
for m in self.output_head:
if isinstance(m, nn.Dropout):
m.train()
return self.output_head(x)
mc_dropout_predict() Utility Function¶
def mc_dropout_predict(model, loader, n_samples=50, device='cuda'):
model.eval()
model.to(device)
preds_mc = []
with torch.no_grad():
for _ in range(n_samples):
batch_preds = []
for xb, _ in loader:
xb = xb.to(device)
pred = model(xb) # Dropout is active due to mc_dropout flag
batch_preds.append(pred.cpu())
preds_mc.append(torch.cat(batch_preds).squeeze(1))
return torch.stack(preds_mc) # Shape: [n_samples, N]
Utility Function for VaR & ES¶
import numpy as np
# Computes Value at Risk (VaR) and Expected Shortfall (ES) per time step using MC samples (n_samples x N).
def compute_var_es(mc_samples, alpha=0.05):
# Convert to numpy
preds = mc_samples.numpy() # shape: [n_samples, N]
var = np.quantile(preds, alpha, axis=0)
es = preds[preds <= var].mean(axis=0) # Conditional tail expectation
return var, es
II. Training Setup¶
from torch.utils.tensorboard import SummaryWriter
import time
def train_model(
model,
train_loader,
val_loader,
criterion,
optimizer,
device,
epochs=50,
patience=8,
log_dir=None
):
model.to(device)
best_val_loss = float("inf")
best_model_state = None
counter = 0
train_losses, val_losses = [], []
writer = SummaryWriter(log_dir=log_dir) if log_dir else None
for epoch in range(epochs):
model.train()
train_loss = 0.0
for xb, yb in train_loader:
xb, yb = xb.to(device), yb.to(device)
optimizer.zero_grad()
preds = model(xb)
loss = criterion(preds, yb)
loss.backward()
optimizer.step()
train_loss += loss.item()
train_loss /= len(train_loader)
train_losses.append(train_loss)
model.eval()
val_loss = 0.0
with torch.no_grad():
for xb, yb in val_loader:
xb, yb = xb.to(device), yb.to(device)
preds = model(xb)
loss = criterion(preds, yb)
val_loss += loss.item()
val_loss /= len(val_loader)
val_losses.append(val_loss)
if writer:
writer.add_scalar("Loss/Train", train_loss, epoch)
writer.add_scalar("Loss/Val", val_loss, epoch)
print(f"Epoch {epoch+1}/{epochs} | Train Loss: {train_loss:.4f} | Val Loss: {val_loss:.4f}")
if val_loss < best_val_loss:
best_val_loss = val_loss
best_model_state = model.state_dict()
counter = 0
else:
counter += 1
if counter >= patience:
print(f"Early stopping triggered at epoch {epoch+1}")
break
if writer:
writer.close()
model.load_state_dict(best_model_state)
return model, train_losses, val_losses
III. Evaluation Function¶
import numpy as np
import torch
def evaluate_model(model, data_loader, criterion=nn.MSELoss(), device="cpu", return_predictions=False):
model.eval()
model.to(device)
preds, targets = [], []
with torch.no_grad():
for xb, yb in data_loader:
xb, yb = xb.to(device), yb.to(device)
pred = model(xb)
preds.append(pred.cpu())
targets.append(yb.cpu())
preds = torch.cat(preds).squeeze()
targets = torch.cat(targets).squeeze()
mse = torch.mean((preds - targets) ** 2).item()
rmse = np.sqrt(mse)
mae = torch.mean(torch.abs(preds - targets)).item()
if return_predictions:
return mse, rmse, mae, preds.numpy(), targets.numpy()
return mse, rmse, mae
IV. Training the Final Architecture¶
# Define config
final_config = {
"model_dim": 128,
"patch_len": 10,
"num_heads": 8,
"num_layers": 3,
"ff_dim": 256,
"dropout": 0.2,
"lr": 0.0005,
"batch_size": 32,
"epochs": 100,
"patience": 10
}
# Prepare data
X_train, y_train, X_val, y_val, X_test, y_test = prepare_data(
df_scaled, feature_cols, seq_len=final_config["patch_len"] * 4
)
train_loader, val_loader = get_data_loaders(X_train, y_train, X_val, y_val, final_config["batch_size"])
test_loader, _ = get_data_loaders(X_test, y_test, X_test, y_test, final_config["batch_size"])
# Initialize model
input_dim = X_train.shape[2]
model = ForecastingTransformer(
input_dim=input_dim,
model_dim=final_config["model_dim"],
patch_len=final_config["patch_len"],
num_heads=final_config["num_heads"],
num_layers=final_config["num_layers"],
ff_dim=final_config["ff_dim"],
dropout=final_config["dropout"]
)
optimizer = torch.optim.Adam(model.parameters(), lr=final_config["lr"])
criterion = nn.MSELoss()
# Train with early stopping
model, train_losses, val_losses = train_model(
model,
train_loader,
val_loader,
criterion,
optimizer,
device=device,
epochs=final_config["epochs"],
patience=final_config["patience"],
log_dir="runs/final_transformer"
)
# Final evaluation
mse, rmse, mae, y_pred, y_true = evaluate_model(model, test_loader, return_predictions=True)
print(f"\nFinal Test Results — Transformer Final Architecture:")
print(f"MSE : {mse:.4f}")
print(f"RMSE: {rmse:.4f}")
print(f"MAE : {mae:.4f}")
/usr/local/lib/python3.11/dist-packages/torch/nn/modules/transformer.py:385: UserWarning: enable_nested_tensor is True, but self.use_nested_tensor is False because encoder_layer.norm_first was True warnings.warn(
Epoch 1/100 | Train Loss: 1.0441 | Val Loss: 1.4191 Epoch 2/100 | Train Loss: 1.0381 | Val Loss: 1.4102 Epoch 3/100 | Train Loss: 1.0216 | Val Loss: 1.4131 Epoch 4/100 | Train Loss: 1.0185 | Val Loss: 1.4353 Epoch 5/100 | Train Loss: 1.0138 | Val Loss: 1.4025 Epoch 6/100 | Train Loss: 0.9872 | Val Loss: 1.4650 Epoch 7/100 | Train Loss: 1.0218 | Val Loss: 1.6516 Epoch 8/100 | Train Loss: 1.0210 | Val Loss: 1.4030 Epoch 9/100 | Train Loss: 0.9826 | Val Loss: 4.2792 Epoch 10/100 | Train Loss: 1.0239 | Val Loss: 1.5222 Epoch 11/100 | Train Loss: 1.1201 | Val Loss: 1.3962 Epoch 12/100 | Train Loss: 1.0525 | Val Loss: 1.3979 Epoch 13/100 | Train Loss: 1.0288 | Val Loss: 1.3999 Epoch 14/100 | Train Loss: 1.0297 | Val Loss: 1.4162 Epoch 15/100 | Train Loss: 1.0308 | Val Loss: 1.4040 Epoch 16/100 | Train Loss: 1.0288 | Val Loss: 1.4056 Epoch 17/100 | Train Loss: 1.0203 | Val Loss: 1.4045 Epoch 18/100 | Train Loss: 1.0162 | Val Loss: 1.4049 Epoch 19/100 | Train Loss: 1.0105 | Val Loss: 1.4502 Epoch 20/100 | Train Loss: 0.9962 | Val Loss: 1.4845 Epoch 21/100 | Train Loss: 0.9911 | Val Loss: 1.5015 Early stopping triggered at epoch 21 Final Test Results — Transformer Final Architecture: MSE : 0.6567 RMSE: 0.8104 MAE : 0.6271
The training and validation losses remained relatively stable across epochs, with early stopping triggered at epoch 21 to prevent overfitting. Despite some minor fluctuations and a spike at epoch 9 (likely due to stochastic dropout sampling or a volatile batch), the model quickly recovered. This behavior reflects the robustness of the patch-based Transformer architecture. While not achieving the lowest error among all models, it produced a well-regularized, generalizable model with an RMSE of 0.8104 and MAE of 0.6271. A strong result given the added benefits of modularity, uncertainty estimation, and deployment-readiness.
Run MC Sampling
n_samples = 50 # Number of stochastic passes
mc_samples = mc_dropout_predict(model, test_loader, n_samples=n_samples)
Compute VaR and ES
# Compute risk metrics at 95% confidence
alpha = 0.05
var_95, es_95 = compute_var_es(mc_samples, alpha=alpha)
Compute Pointwise Mean, Std, and True Values
# Compute predictive mean and std
mean_preds = mc_samples.mean(dim=0)
std_preds = mc_samples.std(dim=0)
# Get ground truth
y_true_tensor = torch.cat([yb for _, yb in test_loader]).squeeze()
Save VaR and ES along with predictions:
import pandas as pd
df_risk = pd.DataFrame({
"Actual": y_true_tensor.numpy(),
"Prediction": mean_preds.numpy(),
"StdDev": std_preds.numpy(),
f"VaR_{int((1-alpha)*100)}": var_95,
f"ES_{int((1-alpha)*100)}": es_95
})
df_risk.to_csv("experiment_transformer_final/risk_metrics.csv", index=False)
Plot Uncertainty
import matplotlib.pyplot as plt
plt.figure(figsize=(10, 5))
plt.plot(y_true_tensor, label="Actual")
plt.plot(mean_preds, label="Mean Prediction")
plt.fill_between(
range(len(mean_preds)),
mean_preds - 2 * std_preds,
mean_preds + 2 * std_preds,
color='green', alpha=0.3,
label="±2 std (uncertainty)"
)
plt.title("Transformer Final: MC Dropout Prediction with Uncertainty")
plt.xlabel("Time Step")
plt.ylabel("Log Return")
plt.legend()
plt.grid(True)
plt.tight_layout()
plt.show()
The orange line shows the mean predicted log return, while the green band represents the ±2 standard deviations (approximate 95% confidence interval) across multiple stochastic forward passes. The model effectively captures higher uncertainty during volatile periods (e.g., near time steps 50, 200, 300), while expressing greater confidence during quieter intervals. This ability to quantify predictive uncertainty is critical in financial applications, where understanding confidence in a forecast can be as important as the forecast itself.
Plot VaR and ES
plt.figure(figsize=(10, 5))
plt.plot(y_true_tensor, label="Actual", alpha=0.6)
plt.plot(mean_preds, label="Mean Prediction", alpha=0.8)
plt.plot(var_95, label=f"VaR ({int((1-alpha)*100)}%)", linestyle='--', color='red')
plt.axhline(y=es_95, color='purple', linestyle='--', label='ES (95%)')
plt.fill_between(
range(len(mean_preds)),
mean_preds - 2 * std_preds,
mean_preds + 2 * std_preds,
color='green', alpha=0.2,
label="±2 std"
)
plt.title("MC Dropout: Mean Prediction, VaR, and Expected Shortfall")
plt.xlabel("Time Step")
plt.ylabel("Log Return")
plt.legend()
plt.grid(True)
plt.tight_layout()
plt.show()
V. Save the Experiment¶
import os, json, pickle
import pandas as pd
import numpy as np
import torch
def save_experiment(
model, config, train_losses, val_losses, y_true, y_pred,
output_dir="experiment_transformer_final", model_filename="project_weights_transformer_final.pt"
):
os.makedirs(output_dir, exist_ok=True)
# Save model weights
torch.save(model.state_dict(), os.path.join(output_dir, model_filename))
# Save config
with open(os.path.join(output_dir, "best_config.json"), "w") as f:
json.dump(config, f, indent=4)
# Save training history
with open(os.path.join(output_dir, "training_history.pkl"), "wb") as f:
pickle.dump({"train_losses": train_losses, "val_losses": val_losses}, f)
# Save predictions
df_preds = pd.DataFrame({
"Actual": np.array(y_true),
"Predicted": np.array(y_pred)
})
df_preds.to_csv(os.path.join(output_dir, "test_predictions.csv"), index=False)
print(f"Final experiment saved to: {output_dir}")
Saving...
save_experiment(
model=model,
config=final_config,
train_losses=train_losses,
val_losses=val_losses,
y_true=y_true,
y_pred=y_pred,
output_dir="experiment_transformer_final",
model_filename="project_weights_transformer_final.pt"
)
Final experiment saved to: experiment_transformer_final
VI. Plot¶
- TensorBoard Visualization
%reload_ext tensorboard
%tensorboard --logdir=runs
Output hidden; open in https://colab.research.google.com to view.
import pickle
import pandas as pd
import matplotlib.pyplot as plt
# Load training history
with open("experiment_transformer_final/training_history.pkl", "rb") as f:
history = pickle.load(f)
train_losses = history["train_losses"]
val_losses = history["val_losses"]
# Load predictions
df_preds = pd.read_csv("experiment_transformer_final/test_predictions.csv")
2. Training vs Validation Loss Plot
plt.figure(figsize=(8, 4))
plt.plot(train_losses, label="Train Loss")
plt.plot(val_losses, label="Val Loss")
plt.xlabel("Epoch")
plt.ylabel("MSE Loss")
plt.title("Transformer (Final): Training & Validation Loss")
plt.legend()
plt.grid(True)
plt.tight_layout()
plt.show()
The training loss steadily decreases, while validation loss remains mostly stable, aside from a sharp spike at epoch 8. This outlier is likely due to a noisy batch or MC Dropout sampling variability, common when using stochastic inference. The model quickly recovers, and early stopping at epoch 21 helps avoid overfitting. Overall, the training behavior is stable and well-regularized, reflecting the robustness of the patch-based design.
3. Line Plot: Predicted vs Actual (Log Returns)
plt.figure(figsize=(10, 4))
plt.plot(df_preds["Actual"], label="Actual", alpha=0.7)
plt.plot(df_preds["Predicted"], label="Predicted", alpha=0.7)
plt.title("Transformer (Final): Predicted vs Actual Log Returns")
plt.xlabel("Time Step")
plt.ylabel("Log Return")
plt.legend()
plt.grid(True)
plt.show()
The predicted values align closely with the general direction of the actual returns , especially during stable periods. While the model continues to smooth out some extreme movements, a common behavior in MSE-optimized regressors. It effectively captures broader trends and turning points. This balance between fidelity and stability makes it well-suited for integration with downstream risk metrics like VaR and Expected Shortfall.
4. Scatter Plot: Actual vs Predicted
plt.figure(figsize=(6, 6))
plt.scatter(df_preds["Actual"], df_preds["Predicted"], alpha=0.5, color='darkorange')
plt.plot([-2, 2], [-2, 2], color='gray', linestyle='--') # identity line
plt.title("Transformer (Final): Actual vs Predicted Scatter")
plt.xlabel("Actual Log Return")
plt.ylabel("Predicted Log Return")
plt.grid(True)
plt.axis("equal")
plt.tight_layout()
plt.show()
The predictions generally cluster around the origin, showing that the model captures the central tendency well. There’s reasonable alignment with the diagonal line, especially for moderate returns. As with other models, the extreme values are slightly underpredicted, a common effect in models trained with MSE loss. Overall, this plot confirms that the model is well-calibrated for typical return ranges and reasonably responsive to directional movement.
VII. Comparisons¶
1. Tuned LSTM vs GRU
1.1 Actual Vs Predictions
import pandas as pd
import matplotlib.pyplot as plt
# Load predictions from saved CSVs
lstm_df = pd.read_csv("/content/experiment_lstm_tuned/test_predictions.csv")
gru_df = pd.read_csv("/content/experiment_gru_tuned/test_predictions.csv")
# Extract actual and predicted values
y_true = lstm_df["Actual"] # Have same ground truth
y_lstm = lstm_df["Predicted"]
y_gru = gru_df["Predicted"]
# Plot
plt.figure(figsize=(12, 5))
plt.plot(y_true, label="Actual", linewidth=1)
plt.plot(y_lstm, label="LSTM Predicted", color="orange")
plt.plot(y_gru, label="GRU Predicted", color="green")
plt.xlabel("Time Step")
plt.ylabel("Log Return")
plt.title("Actual vs LSTM vs GRU Log Returns on Test Set")
plt.legend()
plt.grid(True)
plt.tight_layout()
plt.show()
1.2 Training and Validation Losses
import pickle
import matplotlib.pyplot as plt
# Load LSTM training history
with open("/content/experiment_lstm_tuned/training_history.pkl", "rb") as f:
lstm_history = pickle.load(f)
# Load GRU training history
with open("/content/experiment_gru_tuned/training_history.pkl", "rb") as f:
gru_history = pickle.load(f)
# Plot loss curves
fig, axs = plt.subplots(1, 2, figsize=(14, 5), sharey=True)
# LSTM Plot
axs[0].plot(lstm_history["train_losses"], label="Train Loss")
axs[0].plot(lstm_history["val_losses"], label="Val Loss", color="orange")
axs[0].set_title("Tuned LSTM Loss")
axs[0].set_xlabel("Epoch")
axs[0].set_ylabel("MSE Loss")
axs[0].legend()
axs[0].grid(True)
# GRU Plot
axs[1].plot(gru_history["train_losses"], label="Train Loss")
axs[1].plot(gru_history["val_losses"], label="Val Loss", color="orange")
axs[1].set_title("Tuned GRU Loss")
axs[1].set_xlabel("Epoch")
axs[1].legend()
axs[1].grid(True)
plt.suptitle("Training vs. Validation Loss Comparison: Tuned LSTM vs GRU")
plt.tight_layout()
plt.savefig("lstm_gru_loss_comparison.png")
plt.show()
2. Transfomer Models
2.1 Actual Vs Predictions
import pandas as pd
import matplotlib.pyplot as plt
# Load prediction files
df_vanilla = pd.read_csv("/content/experiment_transformer_vanilla/test_predictions.csv")
df_regularized = pd.read_csv("/content/experiment_transformer_regularized/test_predictions.csv")
df_final = pd.read_csv("/content/experiment_transformer_final/test_predictions.csv")
# Plot
plt.figure(figsize=(14, 5))
plt.plot(df_vanilla["Actual"], label="Actual", color='steelblue', linewidth=1)
plt.plot(df_vanilla["Predicted"], label="Vanilla Transformer", color='orange', linewidth=1)
plt.plot(df_regularized["Predicted"], label="Regularized Transformer", color='purple', linewidth=1)
plt.plot(df_final["Predicted"], label="Patch-based Transformer", color='green', linewidth=1)
plt.title("Actual vs Transformer Predictions (All Variants)")
plt.xlabel("Time Step")
plt.ylabel("Log Return")
plt.legend()
plt.grid(True)
plt.tight_layout()
plt.show()
2.2 Training and Validation Losses
import os
import pickle
import matplotlib.pyplot as plt
# Define paths to pkl files
paths = {
"Vanilla Transformer": "/content/experiment_transformer_vanilla/training_history.pkl",
"Regularized Transformer": "/content/experiment_transformer_regularized/training_history.pkl",
"Patch-based Transformer": "/content/experiment_transformer_final/training_history.pkl"
}
# Define distinct colors
colors = {
"Vanilla Transformer": "orange",
"Regularized Transformer": "purple",
"Patch-based Transformer": "green"
}
# Initialize dictionary to store losses
losses = {}
# Load data from each model
for model_name, path in paths.items():
with open(path, "rb") as f:
history = pickle.load(f)
losses[model_name] = {
"train": history["train_losses"],
"val": history["val_losses"]
}
# Plot side-by-side comparison
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(16, 5), sharey=True)
# Plot training losses
for model_name in losses:
ax1.plot(losses[model_name]["train"], label=model_name, color=colors[model_name])
ax1.set_title("Training Loss Comparison (MSE)")
ax1.set_xlabel("Epoch")
ax1.set_ylabel("Loss")
ax1.legend()
ax1.grid(True)
# Plot validation losses
for model_name in losses:
ax2.plot(losses[model_name]["val"], label=model_name, color=colors[model_name])
ax2.set_title("Validation Loss Comparison (MSE)")
ax2.set_xlabel("Epoch")
ax2.legend()
ax2.grid(True)
plt.suptitle("Training vs. Validation Loss for Transformer Variants")
plt.tight_layout()
plt.show()
Model Architecture Summary¶
This project evaluates multiple deep learning architectures for financial time series forecasting. Below is a summary of the models implemented and compared:
| Model Name | Type | Key Features |
|---|---|---|
| LSTMModel | Recurrent (LSTM) | Captures long-term dependencies using memory cells and gates. |
| GRUModel | Recurrent (GRU) | A simplified version of LSTM with fewer parameters and similar performance. |
| TimeSeriesTransformer | Transformer | Vanilla Transformer with positional encoding and self-attention. |
| TransformerRegularized | Transformer | Adds LayerNorm, dropout regularization, and optional MC Dropout inference. |
| ForecastingTransformer | Patch-based Transformer | Inspired by PatchTST. Uses patch embedding, positional encoding, and global pooling. |
All models use a sliding window of past 30, 40, or 60 days of technical indicators, depending on the configuration.
- Baseline models use
seq_len=30. - Tuned LSTM and GRU models, as well as the Vanilla and Regularized Transformers, use
seq_len=60, as this yielded better results during grid search for both LSTM & GRU models. - The Final Patch-based Transformer uses
seq_len=40, derived frompatch_len=10 × 4, following the patching design inspired by Vision Transformers (ViTs).
- Baseline models use
Monte Carlo Dropout is applied to
TransformerRegularizedandForecastingTransformerfor uncertainty estimation.VaR and ES are computed from the predictive distributions to assess financial risk.
Referencees¶
Vaswani et al. (2017) — Attention is All You Need Introduced the Transformer architecture used in all advanced models. https://arxiv.org/abs/1706.03762
Hochreiter & Schmidhuber (1997) — Long Short-Term Memory Foundation for LSTM baseline. https://www.bioinf.jku.at/publications/older/2604.pdf
Cho et al. (2014) — Gated Recurrent Unit (GRU) Basis for the GRU model. https://arxiv.org/abs/1409.1259
Gal & Ghahramani (2016) — Dropout as a Bayesian Approximation, the implementation of Monte Carlo Dropout inference based on this work. https://arxiv.org/abs/1506.02142
Rockafellar & Uryasev (2000) — Conditional Value at Risk (CVaR), the VaR and ES computations are grounded in this risk framework. https://doi.org/10.21314/JOR.2000.038
Nie et al. (2023) — PatchTST: Forecasting with Patch Attention, the patch-based Transformer is conceptually inspired by this work. https://arxiv.org/abs/2211.14730
Libraries and APIs
- Yahoo Finance API (via yfinance) – Used to fetch S&P 500 OHLCV stock data.
- PyTorch Documentation – TransformerEncoder – Used to build Transformer encoder layers.
- PyTorch Documentation – Dropout – Used in MC Dropout at inference.
- PyTorch Documentation – LayerNorm – Used for normalization in regularized and patch-based models.
- scikit-learn – Used for data normalization (
StandardScaler) and metrics like MAE, MSE, RMSE. - matplotlib – Used for visualizing training curves, prediction results, and uncertainty histograms.
- TensorBoard (PyTorch Integration) – Used to log training/validation loss for analysis.